The XQueC Demo


This page provides an overview of the demo of the XQueC system. This demo has been shown in september 2003  at the VLDB Conference, and in October 2003, at the French BDA Conference.

Related publications


In the current page, we provide a short overview of XQueC's principles, and some snaphots of the demo focus.


XQueC overview


XQueC is a system for storing and querying compressed XML data. In XQueC, we choose to favor XML querying as much as possible, thus our compression preserves individual access to each data item in the (compressed) XML document.
As a consequence, in XQueC we separate document structure from content, and we compress mainly the content (#PCDATA nodes, in basic XML terminology).

Value compression in XQueC

We group all values found under the same root-to-leaf path in an XML document in a value collection that we call a container. For each container, a particular compression algorithm is used: Huffman is one example, the Antoshenkov order-preserving compresssion algorithm is another example.
We compress all values in a container using the same compression dictionary (or compression map), but we store values on disk in a way that leaves them individually accessible.
As a (default) redundant storage structure, we index all value containers using B+ tree indices. Clearly, such indices are not required for query processing, but may speed it up.

Storage encoding in XQueC

We assign structural identifiers to all XML elements inside a document. Such identifiers consist of [preOrder, postOrder, depth] triples; they allow to efficiently reconstruct the structural relationships among two elements. 
Furthermore, we partition the element identifiers according to their path in the document. Thus, for each path in the document, we store a sequence of element identifiers found on that paths; the sequence follows the preOrder order, which is the natural document order.

Query processing in XQueC

Query processing in XQueC is similar to a general-purpose XML query engine, but has the particular feature of manipulating compressed XML data as much as possible, only decompressing it when this is strictly required (lazy decompression). The storage and compression model of XQueC makes this possible, as element identifiers on one hand, and values in containers on another hand, are each individually accessible, and can be manipulated in their compact format.


Demo outline


The demo illustrates the sequence of steps when loading and querying an XML document in XQueC.

Loading a document

The GUI provides a list of the XML documents already loaded in the database, as well as some commands regarding documents and queries.


[Initial XQueC screen]


The "Result" pane at the bottom contains query results when they become available. The "Console" pane provides debugging or information messages.

Loading a document

Assume we want to load a new XML document, following the XMark benchmark XML type description. XMark deals with on-line auctions: items are registered for sale in one out of the 5 continents, some auctions are still open, some are closed, and registered users can participate in auctions as buyers or sellers. "Load document" leads to choosing the XML file to load.

[Loading XMark.xml]


When the loading is finished, the document has been stored, and XQueC has constructed two types of summaries for it; we describe these summaries next.

Document-derived dataguide

First, XQueC builds a path summary, or simplified dataguide, of the document. This path summary is built from the document being read, and represents its structure. The "View dataguide" button in the XQueC GUI lauches GhostView on the PostScript file containing the document's dataguide.
In our case, the dataguide of an XMark XML document can be quite complex. Below is a zoom of a particular area of the XMark document dataguide; it is centered on the area including <person> elements, that is, the registered users of the site.


[Fragment of XMark dataguide]


In this dataguide drawing, the hollow nodes represent XML elements, the rounded coloured nodes are XML attributes, while the coloured box nodes represent text content inside an element. The color is dictated by the data type: strings are depicted in light blue, integers (like a person's age and zipcode) in yellow, and real numbers (like a person's income) are drawn in green. Note that the data type is deduced from the data, as in the XPRESS project; XQueC does not rely on type information to infer atomic value types, although this could be easily added.

On each such class of values, XQueC will apply a specific data compression method. String containers will be compressed differently from integer containers, but also among String containers, multiple compression choices are possible.  In particular, whenever a join on values from two containers is required by a query, it will be possible to evaluate this join on the compressed values iff the same  compression algorithm was used for both containers. In our case, a simple solution was taken: all containers are compressed using the same compression algorithm (which is Huffman), so that any value join may be evaluated in the compressed domain. Many other choices are possible.


Document-derived compression summary

Second, XQueC builds a compression summary of the document, showing what compression ratio was achieved.

[Compression statistics for XMark.xml]


The image above shows, in blue, the structure and content size of the original document, and in red, the compressed resulting size.
  • The first column is the sum of the remaining ones.
  • The second column depicts the size occupied by the document structure: in the original document, this is the total tag size. In the compressed document, the structure is encoded as the sum of all element identifiers. Due to a careful packaging of the [preOrder, postOrder, depth] numbers is as little bytes as possible, the structure encoding of XQueC is very efficient, and spares about 75% of the space occupancy.
  • The third columns represents the space savings achieved by compressing the strings found in the XMark documents. In our case, the compression ratio is of about 60%. This not so impressive result can be explained in several ways. First, XMark documents are generated randomly and have less repetitiveness than regular English text (for example, the "the" word is almost never used in XMark-generated text). Second, the fact that we compressed all containers together deteriorates the compression ratio, as it is likely that values within a particular container will have more mutual similarity than values across all containers. Thus, the compression ratio is affected, at the expense of being able to run *any* value-dependent query in the compressed domain (obviously, queries that do not pose predicates on values, but only navigate on the document structure, can be fully evaluated using structural identifiers, without any decompression, either).
  • The fourth and fifth column depict, respectively, the space occupancy of integer and float values, in the original and in the compressed document. Their weight, although non null, is very small in XMark documents; floats are slightly more frequent than integers, but far less frequent than strings.
  • The first column in the graph ("XML doc") sums up the compressed and uncompressed components explained above. Overall, we achieve a compression ratio of about 50%, which is quite good given that many persistent XML storage systems have a "bloating factor" of 2 to 10. Improving the string compression ratio, at the expense of lazy evaluation, could further improve the picture.


Query processing

We now proceed to discussing query processing in XQueC. The "Load query" button allows to load an XQuery query from an OS file:

[Loading XMark Q1]


The query can be then run using the "Execute Query" button, and the result appears in the Result pane.


[Result of Q1]

For more readability, here is XMark Q1:

for $b in document("XMark.xml")/site/people/person[id="person0"]
return $b/name;


This query asks for the name of the person having the identifier "person0". While not very exciting by itself, this simple query allows to explain how XQueC processes XML queries. To help the explanation, here is the query plan graphical representation that pops up when pushing the "View QEP" button:

[Query plan for Q1]

Let us read this QEP from the bottom up.

  1. At the bottom right, a ContainerAccess operator performs an access to the values in the "site/people/person/@id" container, that is, to the set of compressed attribute values found on that path. ContainerAccess does this by compressing the "person0" search key, and searching the container for records having that value, using the B+ tree index built on the container.
  2. At the bottom left, a PathAccess operator retrieves all [preOrder, postOrder, depth] identifiers of elements found on the path /site/people/person. These are joined with the result of the ContainerAccess operator, using the personID common fields. The SemiJoin at the bottom has succeeded in selecting the identifiers of the <person> elements having the @id attribute equal to "person0".
  3. The result of the semijoin is then materialized (violet round node), then read by two different tuple consumers (the hollow, round nodes).
  4. The leftmost one reads it as such. The rightmost one combines these tuples with those resulting from a Scan on the container with the names of all the persons. The combination is done through a structural join, following a variant of the StackTreeDescendent algorithm. (A Sort was necessary on the /site/people/person/name container, since the values were stored in data order dictated by the B+ tree, while document order was required by StackTreeDescendent).
  5. The result of the leftmost read (providing the person identifiers) and the result of the StackTreeDescendent join (pairing the persons with their compressed names) are provided as input to a SortedOuterUnion operator, that puts the tuples in a format adapted to the final XML result. Notice that the two inputs of the SortedOuterUnion are already sorted in document order, and the union is done in a pipelined manner, on the fly (no intermediary results required).
  6. Only at this point, the results are decompressed, meaning: the compressed person names are decompressed to their original format. All manipulations have been done so far using only structural identifiers, or compressed values; now is the time to decompress the names, in order to compute the final XML result.
  7. The XMLize operator uses the data pre-structured by the SortedOuterUnion, in order to construct XML elements.

The XMark query Q8, although much more complex, is treated in a similar manner.  Here is the query text:

for $p in document("XMark.xml")/site/people/person
let $a := for $t in document("XMark.xml")/site/closed_auctions/closed_auction
            where $t/buyer/@person = $p/@id
            return $t
return <item person="{$p/name/text()}"> {count ($a)} </item>;



In plain English, it asks for the number of auctions won by each person registered at the auction site (this number can be 0 in some cases, if the person never won an auction). The result returned by XQueC has the following aspect:




The query plan of Q8 follows below:

[Plan for XMark Q8]


  1. At the bottom, a MergeJoin pairs each person ID with the  won auctions associated to that person (ContScan of /site/closed_auctions/closed_auction/buyer/@person).
  2. Furthermore, all person identifiers are fetched by the PathAccess operator, and combined (through an OuterJoin ) to the persons having won some auctions.
  3. From this OuterJoin, some persons will be paired with all the auctions they won, and the other ones with a null value.
  4. The Aggregation operator counts the non-null auctions associated with each person (the count() of the XQuery query).
  5. This result (the person identifiers and the number of auctions they won) is materialized, then joined (on the rightmost branch) with the compressed name values; this pairs each person with his/her name.
  6. Finally, all branches are fed into a SortedOuterUnion operator as before, the compressed names are decompressed, and the final XML result is structured as required.

This document outlines the scenario demonstrated at VLDB 2003 and BDA 2003. XQueC stores and compresses XML documents in a manner that fully supports querying. Its main specific points are:
  • focus on data compression, and use of structural identifiers to encode document structure;
  • fine-grained compression, allowing individual access to each compressed value;
  • path-driven storage, compression, statistics gathering, and query processing.


Other sample query plan images:


Back to my homepage.