Publications:
"Answering XML Queries over Heterogeneous Data Sources", accepted for publication in the VLDB conference, 2001. This paper details the query processing framework of Agora. We focus on query translation from the new XML query language to be standardized by the World Wide Web Consortium (W3C) into SQL. (PostScript, PDF). "Agora: Living with XML and Relational", software demonstration at VLDB 2000 (PostScript, PDF). A detailed version of the demo (as you can find it on the VLDB digital antology) is provided below. "Pushing XML Queries inside Relational Databases", Technical Report no. 4211, INRIA, 2001 (PostScript, PDF). This report investigates the usage of Agora as a XML layer on top of relational DBMSs.
Data Integration: Agora is the first data integration system
with a global XML schema adopting the "local-as-view" approach, in which the global schema is defined
as a view over the local schemas.
Data Sources: relational (from RDBMS to everything that can
be wrapped as a table), and DOM-compliant (most importantly XML or anything
wrapped as).
Query Language: We had started with Quilt
,
whose syntax became the basis of the XML query language being standardized
by the W3C; this standard language is called XQuery.
We are adapting to the official XQuery syntax - really easy :-)
Query Processing Algorithms: all relational. Yes, relational
! Check our technical report for more details.
If you are interested in good surveys on data integration and a comparison of the "local-as-view" approach with alternative approaches, there are two important surveys: one by Jeffrey Ullmann, and another one by Alon Levy.You can have a look at what the World Wide Web Consortium Working Group on XML Query Languages (aka W3C XML QWG) is doing. Normally, they're bubbling with activity. Most importantly, they are working on an algebra, and on a data model for XML documents, to be used together to interpret what an XQuery query means.
Who is responsible for all this:
Ioana Manolescu Daniela Florescu Donald Kossmann Dan Olteanu Florian Xhumari Lucian Precup
We demonstrate a data integration system capable of answering XML queries
over both relational and XML data, by using relational query processing
techniques.
The query language for XML queries is Quilt;
we query our relational data sources under an XML representation (the user
can supply one, or we use a simple default representation).
<!DOCTYPE medications -->
<!ELEMENT table (tuple)*>
<!ATTLIST table tableName (#PCDATA)>
<!ELEMENT tuple (generic_name, brand_name, class_mechanism, prescription,
gen_equivalent, preparation, storage, prescribed_for, dosing, interaction,
side_effect)>
<!ELEMENT generic_name (#PCDATA)>
<!ELEMENT brand_name (#PCDATA)>
<!ELEMENT class_mechanism (#PCDATA)>
<!ELEMENT prescription (#PCDATA)>
<!ELEMENT gen_equivalent (#PCDATA)>
<!ELEMENT preparation (#PCDATA)>
<!ELEMENT storage (#PCDATA)>
<!ELEMENT prescribed_for (#PCDATA)>
<!ELEMENT dosing (#PCDATA)>
<!ELEMENT interaction (#PCDATA)>
<!ELEMENT side_effect (#PCDATA)>
<!--]>
As it can be seen, the default representation models the table/tuple/attribute
structure by a simple three-levels hierarchy.
<!ELEMENT article (power_points?, paragraph*)>
<!ATTLIST article name CDATA #REQUIRED>
<!ELEMENT power_points (point)+>
<!ELEMENT point (#PCDATA)>
<!ELEMENT paragraph (#PCDATA|title)*>
<!ELEMENT title (#PCDATA)>
<!ELEMENT related_terms (term)+>
<!ELEMENT term EMPTY>
<!ATTLIST term name CDATA #REQUIRED>
<!ELEMENT related_med_procedures (procedure)+>
<!ELEMENT procedure EMPTY>
<!ATTLIST procedure name CDATA #REQUIRED>
<!ELEMENT related_medications (medication)+>
<!ELEMENT medication EMPTY>
<!ATTLIST medication name CDATA #REQUIRED>
<!--]>
for $d in document("diseases.xml")//diseases,
$a in $d/article,
$p in $a/paragraph,
$r in $d/related_medications/,
$n in $a@name,
$rn in $r/medication,
$m in document("medication.xml")//tuple,
$b in $m/brand_name
where contains($p, "contagious") and content($rn)=content($b)
return <result>
fullText($d),
fullText($m)
</result>
The first block of for makes the variables iterate over elements from the "diseases.xml" document. Thus, $d iterates over all the diseases, $a over all articles concerning this disease, $r over the global list of medications related to it, while $rn iterates over the actual set of medications; $n is bound to the name attribute of the $d disease element. The second for block iterates over all the tuple elements and extracts the name of the medications (remember that the relational table we consider contains medication information). From the complete for block result tuples of bindings for all the variables present inside the block.
The where condition performs a selection on the binding tuples: only those for which $p corresponds to a paragraph containing the word "contagious" are kept, the others are pruned out. Also, the join condition is explicit in the where clause: we want only the bindings from both sources that refer to the same medications (same text content of $rn, which describes the medication in the diseases document, and $b, which corresponds to the medication name in the relational table).
In all, this query asks for all contagious illnesses, and for those illnesses whose recommended medication exists in the medication table, the disease XML element corresponding to $d and the medication tuple, corresponding to the $m variable, are output in a <result> element.
Here is the SQL query corresponding to the for-where clause
of the original Quilt query:
| select | e0.elID as $d, e1.elID as $a, e2.elID as $p, e3.elID as $r, a0.attrID as $n, e4.elID as $rn, e5.elID as $m, e6.elID as $b, |
| null as dText, null as mText | |
| from | Document d0, URI u0, Value v0, TransClosure tc0, Element e0, QName q0, Value v0, |
| Child c0, Element e1, QName q1, Value v1, Child c1, Element e2, QName q2, Value v2, | |
| Child c2, Element e3, QName q3, Value v3, Attribute a0, Value v4, Value v5, | |
| Child c3, Element e4, QName q4, Value v6, Child c4, Value v7, | |
| Document d1, URI u1, Value v8, TransClosure tc1, Element e5, QName q5, Value v9, | |
| Child c5, Element e6, QName q6, Value v10, Contains fn0 | |
| where | d0.docURIID=u0.uriID and u0.uriValID=v0.valID and v0.value="diseases.xml" |
| and d0.docRootID=tc0.parentID and tc0.childID=e0.elID and e0.elQNameID=q0.qNameID and | |
| q0.qnLocalID="disease" and e0.elID=c0.parentID and c0.childID=e1.elID and | |
| e1.elQNameID=q1.qNameID and q1.qnLocalID=v1.valID and v1.value="article" and | |
| e1.elID=c1.parentID and c1.childID=e2.elID and e2.elQNameID=q2.qNameID and q2.qnLocalID=v2.valID and | |
| v2.value="paragraph" and e0.elID=c2.parentID and c2.childID=e3.elID and e3.elQNameID=q3.qNameID and | |
| q3.qnLocalID=v3.valID and v3.value="related_medications" and a0.attrElID=e3.elID and a0.attrNameID=v4.valID and | |
| v4.value="name" and a0.attrValID=v5.valID and e3.elID=c3.parentID and c3.childID=e4.elID and | |
| e4.elQNameID=q4.qNameID and q4.qnLocalID=v6.valID and v6.value="medication" and | |
| d1.docURIID=u1.uriID and u1.uriValID=v8.valID and v8.value="medication.xml" and | |
| d1.docRootID=tc1.parentID and tc1.childID=e5.elID and e5.elQNameID=q5.qNameID and | |
| q5.qnLocalID=v9.valID and v9.value="tuple" and c5.parentID=e5.elID and c5.childID=e6.elID and | |
| e6.elQNameID=q6.qNameID and q6.qnLocalID=v10.valID and v10.value="brand_name" and | |
| fn0.in0=e2.elID and fn0.in1="contagious" and fn0.out=true and c4.parentID=e3.elID and c4.childValID=v7.valID and v7.value=v5.value |
In this query, the red tables and conditions correspond to elements, attributes, and parent-child relationships in the "diseases.xml" document; the blue part of the query comes from the query fragment describing elements of the "medication.xml" document. We left in black the function Contains, which does not belong to any data source; on the last line, the actual join predicate is shown in green.
How is this query constructed ? We use a generic schema (described in the demo paper) that reflects the structure of an XML document and provides natural support for translating XML path expressions. For example, there is a relation Document(docID, docRootID, docURIID), a relation Element(elID, elQNameID) etc. To represent filiation in the original XML documents, joins on foreign keys are used throughout the query. Note that text content (direct #PCDATA children of one node) is represented by Element-Child-Value joins, as it is the case with e3, c4, and v7.
Constructing the remaining two queries Let us denote the query
above as "select e0.elID as $d, e1.elID as
$a, e2.elID as $p, e3.elID as $r, a0.attrID as $n, e4.elID as $rn, e5.elID
as $m, e6.elID as $b from F0 where W0". We construct the two other queries
simply by adding the extra joins that are needed to connect the binding
tuples with the data items needed in the result:
| select | e0.elID as $d, e1.elID as $a, e2.elID as $p, e3.elID as $r, a0.attrIDas $n, e4.elID as $rn, e5.elID as $m, e6.elID as $b, fn1.out as dText, null as mText |
| from | F0, FullText fn1 |
| where | fn1.in1=e0.elID |
| select | e0.elID as $d, e1.elID as $a, e2.elID as $p, e3.elID as $r, a0.attrIDas $n, e4.elID as $rn, e5.elID as $m, e6.elID as $b, null as dText, fn2.out as mText |
| from | F0, FullText fn2 |
| where | fn2.in1=e5.elID |
Having established these translations, the Quilt query is transformed
in the union of these three SQL queries on the generic schema: the query
corresponding to the for-where part, and one query per (non-constant) data
item needed in the result.
| select | e1.elID as $m, v4.value as generic_name, v5.value as prescribed_for |
| from | Document d0, URI u0, Value v0, Child c0, Element e0, QName q0, Value v1, Child c1, Element e1, QName q1, Value v2, |
| Child c2, Element e2, QName q2, Value v3, Child c3, Value v4, Child c4, Value v5 | |
| where | d0.docURIID=u0.uriID and u0.uriValID=v0.valID and v0.value="medication.xml" and d0.docRootID=e0.elID and |
| e0.elQNameID=q0.qNameID and q0.qnLocalID=v1.valID and v1.value="table" and c0.parentID=e0.elID and | |
| c0.childID=e1.elID and e1.elQNameID=q1.qNameID and q1.qnLocalID=v2.valID and v2.value="tuple" and | |
| c1.parentID=e1.elID and c1.childID=e2.elID and e2.elQNameID=q2.qNameID and q2.qnLocalID=v3.valID and | |
| v3.value="generic_name" and c2.parentID=e1.elID and c2.childID=e3.elID and e3.elQNameID=q3.qNameID and | |
| q3.qnLocalID=v3.valID and v3.value="prescribed_for" and c3.parentID=e2.elID and c3.childValID=v4.valID and | |
| c4.parentID=e3.elID and c4.childValID=v5.valID |
x.getElementByName(z)
which returns all the descendents of element x having the tag z.
| select | tc0.parentID as x, e1.elID as y, v0.value as z |
| from | TransClosure tc0, Element e1, QName q0, Value v0 |
| where | tc0.childID=e1.elID and e1.elQNameID=q0.qNameID and q0.qnLocalID=v0.valID |
A similar view definition exists for all the DOM API calls, and the view definitions are used to rewrite the query resulted from the translation of the Quilt query.
| select | c0.parent as $d, t0.element as $a, t1.element as $p, t2.element as $r, t3.element as $rn, null as dText, FnText(med) as mText |
| from | Medication med, Document d0, ElementByName ebn0, Children c0, Tag t0, Children c1, Tag t1, Children c2, Tag t2, Attribute a0, |
| Children c3, Tag t3, Contains fn0, Children c4 | |
| where | d0.docName="diseases.xml" and d0.docRoot=ebn0.ancestor and ebn0.tag="disease" and ebn0.descendent=c0.parent and |
| c0.child=t0.element and t0.tag="article" and c0.parent=c1.parent and c1.child=t1.element and t1.tag="paragraph" and | |
| c0.parent=c2.parent and c2.child=t2.element and t2.tag="related_medications" and c1.parent=a0.attrElem and | |
| a0.attrName="name" and c3.parent=c2.child and c3.child=t3.element and t3.tag="medication" and | |
| fn0.in0=t1.element and fn0.in1="contagious" and fn0.out=true and c4.parent=t2.element and c4.childVal=med.brand_name |
Notice that the elements from the "medication.xml" document have disappeared,
and have been replaced with an occurence of the Medication table. Also,
the FnText() function replaces the generic FullText() function: if FullText()
operates at the level of the virtual schema on any XML document (and constructs
its full textual representation), MedText() has to be implemented directly
at the level of the DBMS storing the Medication table, since its implementation
has to be aware of the actual storage of the data.
For a given ($d, $p, $a, $rn, $r) combination, the dText field is contained in tuples from the second union term, and the mText field is contained in tuples from the third union term. To group together these items, and to be able to output them together in a single <result> element, we perform a sort on the ($d, $p, $a, $rn, $r, label) fields of all tuples in the union. After the sort, using the techniques described here, the information contained in the tuples is assembled together to produce the desided result.
Comments or questions should go to Ioana
Manolescu.