The Agora project

Publications:

  • "Answering XML Queries over Heterogeneous Data Sources", accepted for publication in the VLDB conference, 2001. This paper details the query processing framework of Agora. We focus on query translation from the new XML query language to be standardized by the World Wide Web Consortium (W3C) into SQL. (PostScript, PDF).
  • "Agora: Living with XML and Relational", software demonstration at VLDB 2000 (PostScript, PDF). A detailed version of the demo (as you can find it on the VLDB digital antology) is provided below.
  • "Pushing XML Queries inside Relational Databases", Technical Report no. 4211, INRIA, 2001 (PostScript, PDF). This report investigates the usage of Agora as a XML layer on top of relational DBMSs.

  • Data Integration: Agora is the first data integration system with a global XML schema adopting the "local-as-view" approach, in which the global schema is defined as a view over the local schemas.
    Data Sources: relational (from RDBMS to everything that can be wrapped as a table), and DOM-compliant (most importantly XML or anything wrapped as).
    Query Language: We had started with Quilt , whose syntax became the basis of the XML query language being standardized by the W3C; this standard language is called XQuery. We are adapting to the official XQuery syntax - really easy :-)
    Query Processing Algorithms: all relational. Yes, relational ! Check our technical report for more details.

    Links

     
    If you are interested in good surveys on data integration and a comparison of the "local-as-view" approach with alternative approaches,  there are two important surveys: one by Jeffrey Ullmann, and another one by Alon Levy.

    You can have a look at what the World Wide Web Consortium Working Group on XML Query Languages (aka W3C XML QWG) is doing. Normally, they're bubbling with activity. Most importantly, they are working on an algebra, and on a data model for XML documents, to be used together to interpret what an XQuery query means.

    Who is responsible for all this:

  • Ioana Manolescu
  • Daniela Florescu
  • Donald Kossmann
  • Dan Olteanu
  • Florian Xhumari
  • Lucian Precup

  •  
     



     
     

    Agora: Living with XML and Relational

    Ioana Manolescu, Daniela Florescu, Donald Kossmann, Florian Xhumari, Dan Olteanu

    Software Demonstration at VLDB 2000
     

    We demonstrate a data integration system capable of answering XML queries over both relational and XML data, by using relational query processing techniques.
     

    The query language for XML queries is Quilt; we query our relational data sources under an XML representation (the user can supply one, or we use a simple default representation).
     

    Query processing outline

    A Quilt query is taken through the following steps: The demo poster provides a visual representation of these steps.
     

    Demonstration scenario

    We illustrate the whole query processing chain on the following data collections:
     

    Quilt query

    The following query implements a join between two XML documents: the XML representation of the table, and the real XML document. For brevity, we show directly the simplified form.

    for $d in document("diseases.xml")//diseases,
         $a in $d/article,
         $p in $a/paragraph,
         $r in $d/related_medications/,
         $n in $a@name,
         $rn in $r/medication,

         $m in document("medication.xml")//tuple,
         $b in $m/brand_name
    where contains($p, "contagious") and content($rn)=content($b)
    return <result>
                    fullText($d),
                    fullText($m)
           </result>

    The first block of for makes the variables iterate over elements from the "diseases.xml" document. Thus, $d iterates over all the diseases, $a over all articles concerning this disease, $r over the global list of medications related to it, while $rn iterates over the actual set of medications; $n is bound to the name attribute of the $d disease element. The second for block iterates over all the tuple elements and extracts the name of the medications (remember that the relational table we consider contains medication information).  From the complete for block result tuples of bindings for all the variables present inside the block.

    The where condition performs  a selection on the binding tuples: only those for which $p corresponds to a paragraph containing the word "contagious" are kept, the others are pruned out. Also, the join condition is explicit in the where clause: we want only the bindings from both sources that refer to the same medications (same text content of $rn, which describes the medication in the diseases document, and $b, which corresponds to the medication name in the relational table).

    In all, this query asks for all contagious illnesses, and for those illnesses whose recommended medication exists in the medication table, the disease XML element corresponding to $d and the medication tuple, corresponding to the $m variable, are output in a <result> element.

    Translating the Quilt query to SQL on the generic schema

    General framework

    The translation of the for-where clause is quite easy, because it is conceptually close to SQL: variables iterate over collections, some selections on variable tuples are applied, and the result consists of bindings of tuples. Translating the return clause is more ellaborate, since we have to construct hierarchical XML structure from binding tuples. To do this, we use the "sorted outer union" approach proposed by a research group from IBM Almaden. The idea is the following. To translate a query whose return clause constructsXML structure, a single query is issued first, that returns the binding tuples produced by the for-where block; then, a separate query is issued to retrieve each piece of information that is needed in the result. The tagging module structures the information extracted by these different queries into the desired hierarchical result.
    In our case, the return clause needs two distinct pieces of data: the full textual representation of the disease element, and the textual representation of the medication element.  We issue therefore three SQL queries: one corresponding to the for-where block, and one for each data item needed in the result.

    Here is the SQL query corresponding to the for-where clause of the original Quilt query:
     
    select e0.elID as $d, e1.elID as $a, e2.elID as $p, e3.elID as $r, a0.attrID as $n, e4.elID as $rn, e5.elID as $m, e6.elID as $b,
    null as dText, null as mText
    from Document d0, URI u0, Value v0, TransClosure tc0, Element e0, QName q0, Value v0,
    Child c0, Element e1, QName q1, Value v1, Child c1, Element e2, QName q2, Value v2,
    Child c2, Element e3, QName q3, Value v3,  Attribute a0, Value v4, Value v5,
    Child c3, Element e4, QName q4, Value v6, Child c4, Value v7,
    Document d1, URI u1, Value v8, TransClosure tc1, Element e5, QName q5, Value v9,
    Child c5, Element e6, QName q6, Value v10, Contains fn0
    where d0.docURIID=u0.uriID and u0.uriValID=v0.valID and v0.value="diseases.xml" 
    and d0.docRootID=tc0.parentID and tc0.childID=e0.elID and e0.elQNameID=q0.qNameID and 
    q0.qnLocalID="disease" and e0.elID=c0.parentID and c0.childID=e1.elID and 
    e1.elQNameID=q1.qNameID and q1.qnLocalID=v1.valID and v1.value="article" and 
    e1.elID=c1.parentID and c1.childID=e2.elID and e2.elQNameID=q2.qNameID and q2.qnLocalID=v2.valID and
    v2.value="paragraph" and e0.elID=c2.parentID and c2.childID=e3.elID and e3.elQNameID=q3.qNameID and 
    q3.qnLocalID=v3.valID and v3.value="related_medications" and a0.attrElID=e3.elID and a0.attrNameID=v4.valID and
    v4.value="name" and a0.attrValID=v5.valID and e3.elID=c3.parentID and c3.childID=e4.elID and 
    e4.elQNameID=q4.qNameID and q4.qnLocalID=v6.valID and v6.value="medication" and
    d1.docURIID=u1.uriID and u1.uriValID=v8.valID and v8.value="medication.xml" and 
    d1.docRootID=tc1.parentID and tc1.childID=e5.elID and e5.elQNameID=q5.qNameID and 
    q5.qnLocalID=v9.valID and v9.value="tuple" and c5.parentID=e5.elID and c5.childID=e6.elID and
    e6.elQNameID=q6.qNameID and q6.qnLocalID=v10.valID and v10.value="brand_name" and
    fn0.in0=e2.elID and fn0.in1="contagious" and fn0.out=true and c4.parentID=e3.elID and c4.childValID=v7.valID and v7.value=v5.value

    In this query, the red tables and conditions correspond to elements, attributes, and parent-child relationships in the "diseases.xml" document; the blue part of the query comes from the query fragment describing elements of the "medication.xml" document. We left in black the function Contains, which does not belong to any data source; on the last line, the actual join predicate is shown in green.

    How is this query constructed ? We use a generic schema (described in the demo paper) that reflects the structure of an XML document and provides natural support for translating XML path expressions. For example, there is a relation Document(docID, docRootID, docURIID), a relation Element(elID, elQNameID) etc. To represent filiation in the original XML documents, joins on foreign keys are used throughout the query. Note that text content (direct #PCDATA children of one node) is represented by  Element-Child-Value joins, as it is the case with e3, c4, and v7.

    Constructing the remaining two queries Let us denote the query above as "select e0.elID as  $d, e1.elID as $a, e2.elID as $p, e3.elID as $r, a0.attrID as $n, e4.elID as $rn, e5.elID as $m, e6.elID as $b from F0 where W0". We construct the two other queries simply by adding the extra joins that are needed to connect the binding tuples with the data items needed in the result:
     
     
    select  e0.elID as  $d, e1.elID as $a, e2.elID as $p, e3.elID as $r, a0.attrIDas $n, e4.elID as $rn, e5.elID as $m, e6.elID as $b, fn1.out as dText, null as mText 
    from F0, FullText fn1
    where fn1.in1=e0.elID

     
     
    select  e0.elID as  $d, e1.elID as $a, e2.elID as $p, e3.elID as $r, a0.attrIDas $n, e4.elID as $rn, e5.elID as $m, e6.elID as $b, null as dText, fn2.out as mText
    from F0, FullText fn2
    where fn2.in1=e5.elID

    Having established these translations, the Quilt query is transformed in the union of these three SQL queries on the generic schema: the query corresponding to the for-where part, and one query per (non-constant) data item needed in the result.
     

    Rewriting the query on the generic schema to real data sources

    At this stage, we use standard relational query rewriting algorithms (for bag semantics). We therefore need to show how are the data sources defined as SQL views over the generic schema.

    View definition for the real relational table

    We show below the view definition for the Medication table presented above. For brevity, we show the view definition corresponding to the table having just two columns, "generic_name" and "prescribed_for":
     
    select e1.elID as $m, v4.value as generic_name, v5.value as prescribed_for
    from Document d0, URI u0, Value v0, Child c0, Element e0, QName q0, Value v1, Child c1, Element e1, QName q1, Value v2,
    Child c2, Element e2, QName q2, Value v3, Child c3, Value v4, Child c4, Value v5
    where d0.docURIID=u0.uriID and u0.uriValID=v0.valID and v0.value="medication.xml" and d0.docRootID=e0.elID and
    e0.elQNameID=q0.qNameID and q0.qnLocalID=v1.valID and v1.value="table" and c0.parentID=e0.elID and 
    c0.childID=e1.elID and e1.elQNameID=q1.qNameID and q1.qnLocalID=v2.valID and v2.value="tuple" and 
    c1.parentID=e1.elID and c1.childID=e2.elID and e2.elQNameID=q2.qNameID and q2.qnLocalID=v3.valID and 
    v3.value="generic_name" and c2.parentID=e1.elID and c2.childID=e3.elID and e3.elQNameID=q3.qNameID and 
    q3.qnLocalID=v3.valID and v3.value="prescribed_for" and c3.parentID=e2.elID and c3.childValID=v4.valID and
    c4.parentID=e3.elID and c4.childValID=v5.valID

    View definitions for the XML data sources

    An XML document is accessed via the DOM API; once parsed, it becomes an in-memory object on which DOM calls can be made to retrieve, say, the attributes of a given element, or all its descendents having a specific tag etc.  We model each of these calls as a table, in which certain attributes have to be supplied (by joins or selections) in order to obtain values for the other attributes.  For example, here is the view definition for the call

    x.getElementByName(z)

    which returns all the descendents of element x having the tag z.
     
    select tc0.parentID as x, e1.elID as y, v0.value as z
    from TransClosure tc0, Element e1, QName q0, Value v0
    where tc0.childID=e1.elID and e1.elQNameID=q0.qNameID and q0.qnLocalID=v0.valID

    A similar view definition exists for all the DOM API calls, and the view definitions are used to rewrite the query resulted from the translation of the Quilt query.

    Rewritten Quilt query

    To rewrite the union of SQL terms, each term is rewritten in isolation and the union of the rewritings is then taken.  For illustration, here is the rewriting of the last union term resulting from the translated query, as described above:
     
    select c0.parent as $d, t0.element as $a, t1.element as $p, t2.element as $r, t3.element as $rn, null as dText, FnText(med) as mText
    from Medication med, Document d0, ElementByName ebn0, Children c0, Tag t0, Children c1,  Tag t1, Children c2,  Tag t2, Attribute a0,
    Children c3,  Tag t3,  Contains fn0,  Children c4
    where d0.docName="diseases.xml" and d0.docRoot=ebn0.ancestor and ebn0.tag="disease" and ebn0.descendent=c0.parent and
    c0.child=t0.element and t0.tag="article" and c0.parent=c1.parent and c1.child=t1.element and t1.tag="paragraph" and
    c0.parent=c2.parent and c2.child=t2.element and t2.tag="related_medications" and  c1.parent=a0.attrElem and 
    a0.attrName="name" and c3.parent=c2.child and c3.child=t3.element and t3.tag="medication" and
    fn0.in0=t1.element and fn0.in1="contagious" and fn0.out=true and c4.parent=t2.element and c4.childVal=med.brand_name

    Notice that the elements from the "medication.xml" document have disappeared, and have been replaced with an occurence of the Medication table. Also, the FnText() function replaces the generic FullText() function: if FullText() operates at the level of the virtual schema on any XML document (and constructs its full textual representation), MedText() has to be implemented directly at the level of the DBMS storing the Medication table, since its implementation has to be aware of the actual storage of the data.
     

    Query Optimization and execution

    At this stage we apply an optimization algorithm based on dynamic programming, searching in the space of bushy trees. A particularity of our optimization algorithm is that it has to take into consideration access restrictions, like the one encountered for the ElementByName table: a scan of this table is not possible, rather, the ancestor and tag attributes have to be specified in order to get bindings for the descendents.
    The relational query execution follows the regular iterator model.
     

    Tagging the result

    The execution of the three relational queries (resulting from the rewriting phase) yields three streams of tuples. The first stream contains all tuple bindings resulting from the for-where clause. The second stream contains those binding tuples that could be extended with  FullText($m); in our case, this extension was possible for every binding of $d, so all binding tuples from the first union term are present in the second. In the general case, however, the second term introduces extra joins that may erase some binding combinations from the first term. Similarly, the third union term contains binding tuples extended with the textual representation of $m.
    These tuples coming from different streams contain different informations; to be able to distinguish among them, we introduce a numerical constant column, called "label", which has the value 0 for all tuples coming from the first term, 1 for the tuples coming from the next one etc.

    For a given ($d, $p, $a, $rn, $r) combination, the dText field is contained in tuples from the second union term, and the mText field is contained in tuples from the third union term. To group together these items, and to be able to output them together in a single <result> element, we perform a sort on the ($d, $p, $a, $rn, $r, label) fields of all tuples in the union. After the sort, using the techniques described here, the information contained in the tuples is assembled together to produce the desided result.


    Comments or questions should go to Ioana Manolescu.