This page provides an overview of the demo of the
XQueC system. This demo has been shown in september 2003 at the
VLDB Conference,
and in October 2003, at the French
BDA Conference.
Related publications
- Andrei Arion, Angela Bonifati, Ioana Manolescu and Andrea Pugliese: "Path Summaries and Path Partitioning in Modern XML Databases", World Wide Web Journal, volume 11, no 1, february 2008.
- Andrei Arion, Angela Bonifati, Ioana Manolescu and Andrea Pugliese: "XQueC: A Query-Conscious Compressed XML Database", in ACM TOIT, vol. 7, no. 4, November 2007
- Ioana Manolescu, Andrei Arion, Angela Bonifati and Andrea Pugliese: "Path Sequence-Based XML Query Processing", presented at BDA 2004 (informal proceedings).
- Andrei Arion, Angela Bonifati, Gianni Costa, Sandra d'Aguanno, Ioana Manolescu, and Andrea Pugliese: "XQueC: Pushing queries to compressed XML data", in VLDB 2003 and BDA 2003.
- Andrei Arion, Angela Bonifati, Gianni Costa, Sandra d'Aguanno, Ioana Manolescu and Andrea Pugliese: "Efficient query evaluation over compressed XML data", in EDBT 2004.
In the current page, we provide a short overview of XQueC's principles, and some snaphots
of the demo focus.
XQueC overview
XQueC is a system for storing and querying compressed XML data. In XQueC,
we choose to favor XML querying as much as possible, thus our compression
preserves individual access to each data item in the (compressed) XML document.
As a consequence, in XQueC we separate document structure from content,
and we compress mainly the content (#PCDATA nodes, in basic XML terminology).
Value compression in XQueC
We group all values found under the same root-to-leaf path in an XML document
in a value collection that we call a
container. For each container, a particular
compression algorithm is used: Huffman is one example, the Antoshenkov order-preserving
compresssion algorithm is another example.
We compress all values in a container using the same compression dictionary
(or compression map), but we store values on disk in a way that leaves them
individually accessible.
As a (default) redundant storage structure, we index all value containers
using B+ tree indices. Clearly, such indices are not required for query
processing, but may speed it up.
Storage encoding in XQueC
We assign structural identifiers to all XML elements inside a document.
Such identifiers consist of
[preOrder, postOrder, depth] triples; they allow
to efficiently reconstruct the structural relationships among two elements.
Furthermore, we partition the element identifiers according to their path
in the document. Thus, for each path in the document, we store a sequence
of element identifiers found on that paths; the sequence follows the preOrder
order, which is the natural document order.
Query processing in XQueC
Query processing in XQueC is similar to a general-purpose XML query engine,
but has the particular feature of manipulating compressed XML data as much
as possible, only decompressing it when this is strictly required (lazy decompression).
The storage and compression model of XQueC makes this possible, as element
identifiers on one hand, and values in containers on another hand, are each
individually accessible, and can be manipulated in their compact format.
Demo outline
The demo illustrates the sequence of steps when loading and querying an
XML document in XQueC.
Loading a document
The GUI provides a list of the XML documents already loaded in the database,
as well as some commands regarding documents and queries.
The "Result" pane at the bottom contains query results when they become available.
The "Console" pane provides debugging or information messages.
Loading a document
Assume we want to load a new XML document, following the XMark benchmark
XML type description. XMark deals with on-line auctions: items are registered
for sale in one out of the 5 continents, some auctions are still open, some
are closed, and registered users can participate in auctions as buyers or
sellers. "Load document" leads to choosing the XML file to load.
When the loading is finished, the document has been
stored, and XQueC has constructed two types of summaries for it; we describe
these summaries next.
Document-derived dataguide
First, XQueC builds a path summary, or simplified
dataguide, of the document. This path summary is built from the
document being read, and represents its structure. The "View dataguide" button
in the XQueC GUI lauches GhostView on the PostScript file containing the
document's dataguide.
In our case, the dataguide of an XMark XML document can be quite complex.
Below is a zoom of a particular area of the XMark document dataguide; it
is centered on the area including <person> elements, that is, the registered
users of the site.
In this dataguide drawing, the hollow nodes represent XML elements, the rounded
coloured nodes are XML attributes, while the coloured box nodes represent
text content inside an element. The color is dictated by the data type:
strings are depicted in light blue,
integers (like a person's age and zipcode) in yellow, and
real numbers (like a person's income) are drawn in green. Note
that the data type is deduced from the data, as in the XPRESS project; XQueC
does not rely on type information to infer atomic value types, although this
could be easily added.
On each such class of values, XQueC will apply a specific data compression
method. String containers will be compressed differently from integer containers,
but also among String containers, multiple compression choices are possible.
In particular, whenever a join on values from two containers is required
by a query, it will be possible to evaluate this join on the compressed values
iff the same compression algorithm was used for both containers. In
our case, a simple solution was taken: all containers are compressed using
the same compression algorithm (which is Huffman), so that any value join
may be evaluated in the compressed domain. Many other choices are possible.
Document-derived compression summary
Second, XQueC builds a compression summary of the document, showing what
compression ratio was achieved.
The image above shows, in blue, the structure and content size of the original
document, and in red, the compressed resulting size.
- The first column is
the sum of the remaining ones.
- The second column depicts the size occupied by the document structure: in
the original document, this is the total tag size. In the compressed document,
the structure is encoded as the sum of all element identifiers. Due to a
careful packaging of the [preOrder, postOrder, depth] numbers is as little
bytes as possible, the structure encoding of XQueC is very efficient, and
spares about 75% of the space occupancy.
- The third columns represents the space savings achieved by compressing the
strings found in the XMark documents. In our case, the compression ratio
is of about 60%. This not so impressive result can be explained in several
ways. First, XMark documents are generated randomly and have less repetitiveness
than regular English text (for example, the "the" word is almost never used
in XMark-generated text). Second, the fact that we compressed all containers
together deteriorates the compression ratio, as it is likely that values
within a particular container will have more mutual similarity than values
across all containers. Thus, the compression ratio is affected, at the expense
of being able to run *any* value-dependent query in the compressed domain
(obviously, queries that do not pose predicates on values, but only navigate
on the document structure, can be fully evaluated using structural identifiers,
without any decompression, either).
- The fourth and fifth column depict, respectively, the space occupancy of
integer and float values, in the original and in the compressed document.
Their weight, although non null, is very small in XMark documents; floats
are slightly more frequent than integers, but far less frequent than strings.
- The first column in the graph ("XML doc") sums up the compressed and uncompressed
components explained above. Overall, we achieve a compression ratio of about
50%, which is quite good given that many persistent XML storage systems have
a "bloating factor" of 2 to 10. Improving the string compression ratio, at
the expense of lazy evaluation, could further improve the picture.
Query processing
We now proceed to discussing query processing in XQueC. The "Load query"
button allows to load an XQuery query from an OS file:
The query can be then run using the "Execute Query" button, and the result
appears in the Result pane.
For more readability, here is XMark Q1:
for $b in document("XMark.xml")/site/people/person[id="person0"]
return $b/name;
This query asks for the name of the person having the identifier "person0".
While not very exciting by itself, this simple query allows to explain how
XQueC processes XML queries. To help the explanation, here is the query plan
graphical representation that pops up when pushing the "View QEP" button:
Let us read this QEP from the bottom up.
- At the bottom right, a ContainerAccess
operator performs an access to the values in the "site/people/person/@id"
container, that is, to the set of compressed attribute values found on that
path. ContainerAccess does this by compressing the "person0" search key,
and searching the container for records having that value, using the B+ tree
index built on the container.
- At the bottom left, a PathAccess
operator retrieves all [preOrder, postOrder, depth] identifiers of elements
found on the path /site/people/person. These are joined with the result of
the ContainerAccess operator, using the personID common fields. The
SemiJoin at the bottom has succeeded in selecting the identifiers
of the <person> elements having the @id attribute equal to "person0".
- The result of the semijoin is then materialized (violet round node),
then read by two different tuple consumers
(the hollow, round nodes).
- The leftmost one reads it as such. The rightmost one combines these
tuples with those resulting from a Scan on the container with the names of
all the persons. The combination is done through a structural join, following
a variant of the StackTreeDescendent algorithm. (A Sort was necessary on
the /site/people/person/name container, since the values were stored in data
order dictated by the B+ tree, while document order was required by StackTreeDescendent).
- The result of the leftmost read (providing the person identifiers)
and the result of the StackTreeDescendent join (pairing the persons with
their compressed names) are provided as input to
a SortedOuterUnion operator, that puts the tuples in a format
adapted to the final XML result. Notice that the two inputs of the SortedOuterUnion
are already sorted in document order, and the union is done in a pipelined
manner, on the fly (no intermediary results required).
- Only at this point, the results are decompressed,
meaning: the compressed person names are decompressed to their original
format. All manipulations have been done so far using only structural identifiers,
or compressed values; now is the time to decompress the names, in order to
compute the final XML result.
- The XMLize operator uses the data
pre-structured by the SortedOuterUnion, in order to construct XML elements.
The XMark query Q8, although much more complex, is treated in a similar manner.
Here is the query text:
for $p in document("XMark.xml")/site/people/person
let $a := for $t in document("XMark.xml")/site/closed_auctions/closed_auction
where $t/buyer/@person = $p/@id
return $t
return <item person="{$p/name/text()}"> {count ($a)} </item>;
In plain English, it asks for the number of auctions won by each person registered
at the auction site (this number can be 0 in some cases, if the person never
won an auction). The result returned by XQueC has the following aspect:
The query plan of Q8 follows below:
- At the bottom, a MergeJoin pairs
each person ID with the won auctions associated to that person
(ContScan of /site/closed_auctions/closed_auction/buyer/@person).
- Furthermore, all person identifiers are fetched by the
PathAccess operator, and combined (through an
OuterJoin ) to the persons having won some auctions.
- From this OuterJoin, some persons
will be paired with all the auctions they won, and the other ones with a
null value.
- The Aggregation operator counts
the non-null auctions associated with each person (the count() of the XQuery
query).
- This result (the person identifiers and the number of auctions they
won) is materialized, then joined (on
the rightmost branch) with the compressed name values; this pairs each person
with his/her name.
- Finally, all branches are fed into a SortedOuterUnion
operator as before, the compressed names are
decompressed, and the final XML
result is structured as required.
This document outlines the scenario demonstrated at VLDB 2003 and BDA 2003.
XQueC stores and compresses XML documents in a manner that fully supports
querying. Its main specific points are:
- focus on data compression, and use of structural identifiers to encode
document structure;
- fine-grained compression, allowing individual access to each compressed
value;
- path-driven storage, compression, statistics gathering, and query processing.
Other sample query plan images:
Back to my homepage.