1
Raymond K. Wong Franky Lam William M.
Shui
National ICT Australia,
{raymond.wong, franky.lam,
bill.shui}@nicta.com.au
H.3.2Information SystemsInformation Storage H.2.4.nTextual DatabasesXML Databases Algorithms, Design, Performance
When looking for a compact storage scheme for XML, there are several issues that need to be addressed. For example, it has to support fast operations, especially we are considering software applications that target people on the move. Moreover, if intensive compression methods are employed, they need to be optional and can be switched on or off due to low computation power of some mobile devices. In summary, from our experience, the major issues include:
In this paper, we propose a compact XML storage engine, called
ISX (for Integrated Succinct XML system), to store XML in a more
concise structure and address all of the above issues.
Theoretically, ISX uses an amount of space near the information
theoretic minimum on random trees. For a constant , where
, and a document with
nodes, we need
bits to represent the topology
of the XML document. Node insertions can be handled in constant
time on average but worst case
time, and all node navigation
operations take worst case
time but constant time
on average.
The rest of this paper is organized as follows: Section 2 summarizes relevant work in the field. Section 3 presents the basics of ISX and its topology layer. The fast node navigation operators, the querying interfaces and the update mechanism are then described in detail in Section 4. Finally, Section 5 presents the experiment results and Section 6 concludes the paper.
Related work that share the same motivations with this paper includes Maneth et al [17], Tolani and Haritsa [22], Min et al [19] and Buneman et al [4]. Compared to XMill, XGrind [22] has a lower compression ratio but supports certain types of queries. XPRESS [19] uses reverse arithmetic encoding to encode tags using start/end regions. Both XGrind and XPRESS require top-down query evaluation, and do not support set-based query evaluation such as structural joins.
Buneman et al [4] separate the tree structure and its data. They then use bi-simulation to compress the documents that share the same sub-tree, however, they can only support node navigations in linear time. With a similar idea but different technique, Maneth et al [17,5] also compress XML by calculating the minimal sharing graph equivalent to the minimal regular tree grammar. In order to provide tree navigations, a DOM proxy that maintains runtime traversal information is needed [5]. Since only the compression efficiency was reported in the paper, both query and navigation performance of their proposed scheme are unclear.
Most XML storage schemes, such as [15,9,10,12], make use of interval and
preorder/postorder labeling schemes to support constant time
order lookup, but fail to address the issue of maintenance of
these labels during updates. Recently, Silberstein et al
[21]
proposed a data structure to handle ordered XML which guarantees
both update and lookup costs. Similarly, the L-Tree labeling
scheme proposed by Chen et al [6] addressed
the same problem and has the same time and space complexity as
[21],
however, they do not support persistent identifiers. The major
difference between our proposal and these two work is that we try
to minimize space usage while allowing efficient access, query
and update of the database. In this paper, we show that our
proposed topology representation costs linear space while
[21] costs
space.
The work most related to this paper regarding databases with efficient storage is from Zhang et al [23]. The succinct approach proposed by Zhang et al [23] targeted secondary storage, and used a balanced parentheses encoding for each block of data. Unfortunately, their summary and partition schemes support rank and select operations in linear time only. Their approach also uses the Dewey encoding for node identifiers in their indexes. The drawbacks of the Dewey encoding are significant: updates to the labels require linear time, and the size of the labels is also linear to the size of the database in the worst case. Thus, the storage of the topology can require quadratic space in the worst case.
Finally, there are several related proposals published recently, e.g. [9,8]. [9] show that all XPath axes can be handled using a preorder/postorder labeling. Instead of maintaining these two labels (i.e., two integers), our proposed scheme requires less than 3 bits per node to process all XPath axes, which is an attractive alternative for applications that are both space and performance conscious.
Ferragina et. al.[8] first shred the XML tree into a table of two columns, then sort and compress the columns individually. It does not offer immediate capability of navigating or searching XML data unless an extra index is built. However, the extra index will degrade the overall storage size (i.e., the compression ratio). Furthermore, the times for disk access and decompression of local regional blocks have been omitted from their experiments. As a result, the performance of actual applications may be different from what the experiments shown. Same as most other related work, data updates have been disregarded.
This section describes the storage layer of the ISX system. It consists of three layers, namely, topology layer, internal node layer, and leaf node layer. In Figure 3, the topology layer stores the tree structure of the XML document, and facilitates fast navigational accesses, structural joins and updates. The internal node layer stores the XML elements, attributes, and signatures of the text data for fast text queries. Finally the leaf node layer stores the actual text data. Text data can be compressed by various common compression techniques and referenced by the topology layer.
Jacobson [11] showed that the
lower bound space requirement for representing a binary tree is
bits, where the Catalan number
is the number
of possible binary trees over
nodes.
Our storage scheme is based on the balanced parentheses encoding from [14], representing the topology of XML. Different from [14], our topology layer (Figure 3) actually supports efficient node navigation and updates.
The balanced parentheses encoding used in tier 0 reflects the nesting of element nodes within any XML document and can be obtained by a preorder traversal of the tree: we output an open parenthesis when we encounter an opening tag and a close parenthesis when we encounter a closing tag. In Figure 3, the topology of a DBLP XML fragment shown in Figure 1 is represented in tier 0 using the balanced parentheses encoding. In our implementation, we use a single bit 0 to represent an open parenthesis and a single bit 1 to represent a close parenthesis.
We avoid any pointer based approach to link a parenthesis to
its label, as it would increase the space usage from to a less
desirable
. As our representation of the
topology also does not include a
bit
persistent object identifier for each node in the document, we
must find a way to link the open parenthesis of
in tier 0 to the
actual label itself. To address this, we adopt from Munro's work
[20]
although they do not use balanced parentheses encoding. Instead,
they control the topology size by using multiple layers of
variable-sized pointers, and may require many levels of
indirection. In addition, we make the element structure an exact
mirror of the topology structure instead of mirroring to the
pointers. This allows us to find the appropriate label for a node
by simply finding the entry in the corresponding position at the
element structure. As mentioned earlier, a pointer based approach
would require space usage of
, which is undesirable. The next
issue is to handle the variable length of XML element labels. We
adopt the approach taken in previous work [22,23], and maintain a
symbol table, using a hash table to map the labels into a domain
of fixed size. In the worst case, this does not reduce the space
usage, as every node can have its own unique label. In practice,
however, XML documents tend to have a very small number of unique
labels. Therefore, we can assume that the number of unique labels
used in the internal nodes (
) is very small,
and essentially constant. This approach allows us to have fixed
size records in the internal node layer.
Note that each element in the XML document actually has two
available entries in the array, corresponding to the opening and
closing tags. We could thus make the size of each entry
bits, and split the identifier
for each elements over its two entries. However, the two entries
are not in general adjacent to each other, and hence splitting
the identifier could slow down lookups as we would need to find
the closing tag corresponding to the opening tag and decrease
cache locality. Hence, we prefer to use entries of
bits and
leave the second entry set to zero; this also provides us with
some slack in the event that new element labels are used in
updates.
Since text nodes are also leaf nodes, they are represented as
pairs of adjacent unused spaces in the internal node layer. We
thus choose to make use of this ``wasted'' space by storing a
hash value of the text node of size bits. This
can be used in queries which make use of equality of text nodes
such as /
/*[year="2003"], by scanning the hash value
before scanning the actual data to significantly reduce the
lookup time. Since texts are treated independently from the
topology and node layers, they can be optionally compressed by
any compression schemes. Instead of employing more sophisticated
compression techniques such as BWT [8] that are relatively slow on
mobile devices, a standard LZW compression method (e.g., gzip) is
used in this paper.
Given an arbitrary node of a large XML
document, a navigation operator should be able to traverse back
and forth the entire document via various step axes of node
.
Some frequently used step axes for an XML document tree are
parent, first-child, next-sibling, previous-sibling, next-following and next-preceding. These step axes can then be used
to provide programming interfaces, such as the DOM API, for
external access to the XML database.
Node navigation operators are described by the pseudo-code in Algorithm 1, which shows a tight coupling between the ISX topology layer primitives and the navigation operators. Each navigation operator in Algorithm 1 is mapped to a sequence of calls to the topology layer primitives described in Algorithm 2.
Node navigation operators are highly dependent on topology layer primitives such as FORWARDEXCESS and BACKWARDEXCESS. In the worst case, node navigation operators could take linear time. However, we can significantly improve the performance of the topology layer primitives by adding auxiliary data structures (tier 1 and tier 2 blocks) on top of the tier 0 layer described in Section 3.1.
Figure 4 presents the auxiliary tiers
1 (
) and 2 (
), where each tier contains contiguous arrays
of tuples, with each tuple holding summary information of one
block in the lower tier. The tier 0 in the figure corresponds to
the balanced parentheses encoding of
the topology of the XML document, which was described in Section
3. For tiers 1 and 2, each tier 1
block stores an array of tier 0 tuples
, where
is the maximum
number of tuples allowed per tier 1 block. Each
for
is defined as
and the
density of each tier 0 block can be calculated by using the
formula
. For
each tier 0 tuple,
is the total number of left parentheses of a
block;
is the total number of right parentheses of a
block;
is the minimum excess within a single block by
traversing the parentheses array forward from the beginning of
the block;
is the maximum excess within a single block by
traversing the parentheses array forward from the beginning of
the block;
is the minimum excess within a single block by
traversing the parentheses array backward from the last
parenthesis of the block;
is the maximum
excess within a single block by traversing the parentheses array
backward from the last parenthesis of the block; and
is total
number of character data nodes. In tier 2, each block stores an
array of tier 1 tuples
, where
is the maximum
number of tuples allowed per tier 2 block, Each tuple
for
is then defined as
, where:
is the sum of all
for all tier 1 tuples
(
);
is the sum of all
for all tier 1 tuples
(
);
is the local forward minimum excess across all of its tier 1
tuples;
is the local forward maximum excess across all
of its tier 1 tuples;
is the local backward minimum excess
across all of its tier 1 tuples;
is the local
backward maximum excess across all of its tier 1 tuples; and
is the total number of character data nodes for all tier 1 tuples
(
).
Although both tier 1 and tier 2 tuples look similar, the
values of ,
,
and
in tier 2 are calculated differently to that of in tier 1. For
tier 2, the function TIER2LOCALEXCESS
in Algorithm 3 is used to calculate the
local minimum/maximum excess and it is not as trivial as the
calculation for tier 1 blocks.
Let
be a tier 2 tuple holding the
summary information for the tier 1 tuples
. To calculate the local forward
minimum excess
, we know the local minimum excess from the
beginning of the first parentheses of
until the end
of
is equal to
, we then assign this value to
. We know the
excess at the end of
is
, so the
minimum of
and
gives the forward minimum excess
from beginning parenthesis of
to the end
parenthesis of
. Similarly, the minimum of
gives the minimum excess between the beginning parenthesis of
to the end parenthesis of
. Therefore,
can be calculated by scanning its tier 1 tuples, updating the
excess along the way. Both maximum and minimum forward excesses
can be calculated at the same time. For backward excesses, the
algorithm is identical, except for the direction of traversal of
the tier 1 tuples.
In the ISX system, the fixed block size for each tier is 4
kilobytes in size. Therefore, each tier 0 block can hold up to
32768 bits and each tier 1 block can hold
tier 0 blocks.
Similarly, each tier 2 block can hold up to
tier 1 blocks,
which is equivalent to
tier 0
blocks. For a 32-bit word machine, there are only 2 tier 2 blocks
and in theory, there are
tier 2 blocks. Therefore, the
worst case for navigational accesses is
, which is not much of an
improvement on
. Fortunately, it is relatively simple to fix
this limitation: instead of having
tiers, we
generalize the above structure in a straightforward fashion to
use
tiers. This means that the
top-most tier has
blocks, reducing the worst case navigational access time to
.
FORWARDEXCESS and
BACKWARDEXCESS
return the position of the first parenthesis matching the given
excess within a given range
(in forward and backward direction
respectively).
Using the auxiliary structures (tiers 1 and 2), instead of
just a linear scan of tier 0 layer, we can use tier 1 to test
whether the position of the parenthesis, matching excess, lies
within the
-th tier 0 block, i.e., checking whether
, where
is the excess between
and the beginning of the
-th tier 0 block
(excluding the first bit). However, as
, there are potentially
tier 1 tuples to scan. Hence, we
use tier 2 find the appropriate tier 1 block within which
lies, thus reducing the cost to a near
constant in practice.
Using the above approach, we can replace primitives
NEXT, FORWARDEXCESS and
BACKWARDEXCESS in
Algorithm 2 with improved primitives in
Algorithm 4. Furthermore, since the
depths of real-world XML documents are generally less than
(even the depth of the highly nested
Tree Bank dataset [18] is much less than 100),
most matching parentheses lie within the same block, and
occasionally are found in neighboring blocks. Therefore, when
FASTFORWARDEXCESS
is called from navigation operations, we rarely need to access
additional blocks in either the auxiliary data structure or the
topology bit array. In the worst case, when the matching
parentheses lie within different blocks, we only need to read two
tier 1 blocks and two tier 2 blocks for a 32-bit word size
machine, which is very small in size.
In ISX system, we also facilitate efficient update operators, such as node insertion. So far for tier 0 layer, we have appeared to treat the balanced parentheses encoding as a contiguous array. This scheme is not suitable for frequent updates as any insertion or deletion of data would require shifting of the entire bit array.
In this section, we present the modification to our storage
scheme, that changes the space usage from to
, where
, so that we can efficiently
accommodate frequent updates.
![]() |
In our approach, we first divide the array into blocks of
bits each, and store the blocks
contiguously. Within each block, we leave some empty space by
storing them at the rightmost portion of each block. Now, we only
need to shift
entries per insertion or deletion.
We can control the cost of shifting by adjusting the block
size.
After the initial loading of an XML document, the empty space
allocated to leaf nodes will eventually be used up as more data
is inserted into the database. Therefore, we need to guarantee an
even distribution of empty bits across the entire parentheses
array, so that we can still maintain the
bound for the number of shifts needed for each data insertion.
This can be achieved by deciding exactly when to redistribute
empty space among the blocks and which blocks are to be involved
in the redistribution process.
To better understand our approach, we first visualize these
blocks as leaf nodes of a virtual balanced
binary trie, with the position of the block in the array
corresponding to the path to that block through the virtual
binary trie. Figure 5 shows such a trie,
where block 0 corresponds to the leaf node under the path
, and similarly block 3
corresponds to the path
. For each block, we define:
Given the above definition of density for leaf nodes, the
density of a virtual node is the average density of its
descendant leaf nodes. We then control the empty space within all
nodes in the virtual binary trie by setting a density threshold
, within which the block densities must
lie. For a virtual node at height
and depth
in the virtual trie, we enforce a density threshold of
. For example, the density threshold range for virtual node
in Figure 5 is
, since the depth for
is 2 and height of the trie is 3.
Why do we use the formula above for
controlling the density threshold? This is due to two
factors: first, in order to guarantee good space utilization, the
maximum density of a leaf node should be 1, and the minimum
density threshold of root node should be . Secondly,
the density threshold should satisfy the following invariant: the
density threshold range of an ancestor node should be tighter
than the range for its descendant nodes. This is so that space
redistribution for an ancestor node
, the density
threshold of all its descendants are also immediately
satisfied.
In the worst case, we use bits per node,
since the root node can be only half full. Thus, on a
-bit word machine, we can store at most
nodes. However, by adjusting the
minimum root node density threshold, from
to
it is possible to store more
than
nodes by choosing a smaller
. In
practice,
should be
and therefore
bits is in effect
. The factor
should only be less than
when the
document is relatively static.
Notice that although we shift the parentheses within tier 0
during update, we never need to shift the tuples in tier 1
because the same tuple always corresponds to the same tier 0
block, regardless of its density. Therefore unlike tier 0, we do
not need to redistribute tuples within tier 1 (similarly for tier
2) during the update operation.
From Section 4.2, the auxiliary
tiers may first appear to increase the update costs to
, since moving a node
requires updating
tiers. However, this overhead
can be eliminated by updating the upper tiers once per
redistribution, instead of once per node. A simple proof then
demonstrates that the overall update cost is unaffected, and
remains
.
During the insertions and deletions in a tier 0 block, we simply update the appropriate tuples in the corresponding blocks in the higher tiers. Since the redistribution process we described in Section 4.4.1 can be seen as a sequence of insertions and deletions, the corresponding updates to the auxiliary tiers do not affect the worst case complexity for updates.
Having
bits used per node including update,
using 32-bits word, we can store as much as
nodes. In
our implementation we also chose to use four kilobytes sized
block. Based on these values, we now discuss the space cost of
each component of our storage scheme. Of course, if larger
documents need to be stored, we can increase the word size that
we use in the data structure and adjust the bit length used on
tier 1 and tier 2.
Since we only need a maximum of two tier 2 blocks, even for
nodes document, we can just keep them in
main memory. In fact, the entire tier 1 can also be kept in main
memory, since it requires at most
. In summary, the
space required by the topology layer (in bits) is:
and the space required by the internal node layer (in bits)
plus the symbol table is:
We can use the above equations to estimate the space used by
an XML file, using as our example a 100 MB copy of DBLP, which
was roughly 5 million nodes. If we assume there are no updates
after the initial loading, we can set
. According to the equation, we will
have used roughly
for the topology
layer, and
. This, of
course, disregards the space needed for the text data in the
document.
Based on the block size , we
know the exact size of tuples and tiers in our topology layer.
Therefore, given a bit position
, we
can calculate which tier 0 block this bit belongs to and which
tier 1 block contains summary information for the tier 0 block.
For a given
, Algorithm 6
lists all the calculations needed to find its resident tier 0 to
tier 2 blocks and the index within the blocks to get the
summary.
|
The ISX system is implemented in C++ using Expat XML parser1. In this section, we compare the performance of ISX with other related implementations, namely, XMill [16], XGrind [22], NoK [23] and TIMBER [12]. Experiments were setup to measure various performances according to the feature matrix of these implementations as shown in Table 1.
We used an Apple G5 2.0 GHz machine with 2.5GB RAM and 160GB
of 7,200 RPM IDE hard drive. The memory buffer pool of ISX has
been fixed to 64MB for all the experiments. Three XML datasets
were used, namely, DBLP [1], Protein Sequence Database
(PSD) [3], TreeBank
[18]. We found that
the experiment results from PSD are very similar to those from
DBLP due to their regular, shallow tree structure. Therefore, PSD
results are skipped from some plots below for clarity. Large
datasets (i.e., 1GB) were generated by repeatedly duplicating
and merging the source dataset, e.g., the 16GB DBLP document
contains more than 770 million nodes.
|
Table 2 and 3 show that XMill has the best compression ratio for both DBLP and TreeBank datasets. Compared to XMill that does not support any direct data navigation and queries, XGrind does allow simple path expressions. Therefore, it has a relatively less attractive compression ratio. In fact, XGrind failed to run on large datasets in our experiments. Both XMill and XGrind have better space consumption as they are primarily designed for read-only data and do not support efficient updates. Furthermore, they only support access to the compressed data in linear time.
Table 2 and 3 show again that ISX is relatively less sensitive to the structure of the data. Although the compression ratio of ISX for TreeBank is not as good as for DBLP, the reason is that TreeBank has the text content that are harder to compress (TreeBank text are more random than the DBLP's). XMill compression ratio on TreeBank is relatively much worse than that on DBLP is due to both the random text content as well as the more complex tree structure of the data.
The performance comparison of bulk loading using ISX, NoK, XGrind and XMill are shown in Figure 6. For the smaller datasets (up to 500MB DBLP), Figure 6(a) shows our ISX system significantly outperforms NoK and TIMBER in loading. It also highlights the scalability of ISX in loading large datasets.
[ISX vs. TIMBER and NoK (up to 500
MB data)]![]() [ISX vs. XGrind and XMill (up to 16 GB data)] ![]() |
To further test the scalability of loading even larger XML documents, we compared the loading time of ISX and the other well known systems such as XMill and XGrind on 1 to 16 GB of DBLP documents. During the loading process, XGrind failed to load XML documents greater than 100MB. Although Figure 6(b) shows that the loading time for ISX is slower than XMill's, it still exhibits a similar trend (similar scalability). The gap between the two curves is contributed by the fact that ISX does not compress the XML data as much as XMill does. This results in a larger storage layer than XMill, which will then uses higher number of disk writes.
When consider using the proposed structure as a storage scheme of a full-fledged database system, one must consider its query performance. Figure 7 (with details listed in Tables 4) shows the query performance of ISX against other schemes. Note that the query times in Figure 7 are in logarithmic scale. From this experiment, we found that ISX outperforms other systems in either the ISX or ISX Stream (using the TurboXPath approach [13]) modes. The performance of ISX is measured by using binary structural join to perform XPath queries; while ISX Stream execute the same query by scanning ISX topology layer linearly.
[DBLP Q1]![]() [DBLP Q2] ![]() [DBLP Q3] ![]() [DBLP Q4] ![]() [DBLP Q5] ![]() [DBLP Q6] ![]() |
[Node navigation time
(microseconds)]![]() [Full document traversal time (seconds)] ![]() |
To test the performance and scalability of random node navigation, we pre-loaded our XML datasets, and for each database, we randomly picked a node and called the node traversal functions (e.g., FIRSTCHILD, NEXTSIBLING) multiple times. The average access time for these node traversal operations are plotted in Figure 8(a). The graph shows that as the database size gets bigger, the running time for these functions remains constant. This is not surprising, since in general most nodes are located close to their siblings, and hence are likely to be in the same block. For example, it generally only takes a scan of a few bits on average to access either the first child node or the next sibling node. Some operations are faster than the others, due to their different implementation complexity (listed in Algorithm 1) and the characteristics of the encoding itself. For instance, as Figure 8(a) shows, FIRSTCHILD performed slightly faster than NEXTSIBLING function, because the first child is always adjacent to a node, whereas its next sibling might be several nodes away.
With fast traversal operations, ISX can traversal XML data in the proposed compact encoding significantly faster than other XML compression techniques such as XMill, as shown in Figure 8(b). We argue that this feature is important to examine the content of large XML databases or archives.
The worst case for Algorithm 5
happens when nodes are inserted at the beginning of a completely
packed database, i.e., with no gaps between blocks. The insertion
experiment was set to measure its average worst case performance
by inserting nodes at the beginning of the database. For each
experiment, we did multiple runs (resetting the database after
each run). The average insertion times (per node) are shown in
Figures 9. In Figure 9, we see an initial spike in the execution
time for the worst case insertion. This corresponds to the
initial packed state of the database, in which case the very
first node insertion requires the redistribution of the entire
leaf node layer. Clearly, in practice this is extremely unlikely
to happen, but the remainder of the graph demonstrates that even
this contrived situation has little effect on the overall
performance. The graph also shows that the cost of all subsequent
insertions increases at a rate of approximately
. In fact, all subsequent insertions
up to 100,000 took no more than 0.5 milliseconds.
Updating the values of nodes will not cause extra processing time apart from the retrieval time for locating the nodes to be updated. In case of deletion, the reverse sequence of steps for node insertion will be performed (freed space will be left as gaps to be filled by subsequent insertions).
A compact and efficient XML repository is critical for a wide range of applications such as mobile XML repositories running on devices with severe resource constraints. For a heavily loaded system, a compact storage scheme could be used as an index storage that can be manipulated entirely in memory and hence substantially improve the overall performance. In this paper, we proposed a scalable and yet efficient, compact storage scheme for XML data.
Our data structure is shown to be exceptionally concise, without sacrificing query and update performance. While having the benefits of small data footprint, experiments have shown that the proposed structure still out-performs other XML database systems and scales significantly better for large datasets. In particular, all navigational primitives can run in near constant time. Furthermore, as shown in the experiments, our proposed structure allows direct document traversal and queries that are significantly faster and more scalable than previous compression techniques.