Free Essay

Submitted By rima007

Words 38239

Pages 153

Words 38239

Pages 153

SCHOOL OF COMPUTER SCIENCE & ENGINEERING

ISSUES IN PHYSICAL XML DATABASE DESIGN

Damien Fisher (3065680) Bachelor of Computer Science (Honours)

Supervisor: Dr. Raymond Wong

Submission Date: October 29, 2003

Abstract Recent years have seen a surge in the popularity of XML, a markup language for representing semistructured data. Some of this popularity can be attributed to the success that the semi-structured data model has had in environments where the relational data model has been insuﬃciently expressive. It is thus natural to consider native XML databases, which are designed from the ground up to support XML data. Developing a native XML database introduces many challenges, some of which we consider here. The ﬁrst major problem is that XML data is ordered, whereas relational databases operate on set-based data. We examine the ordering problem in great detail in this thesis, and show that while it imposes an unavoidable performance penalty on the database, this penalty can be reduced to an acceptable level in practice. We do this by making use of type information, which is often present in XML data, and by improving existing results in the literature. XML data is frequently queried using XPath, a de facto standard query language. The importance of XPath is increasing, due to its use as a primitive in the more powerful emerging query language, XQuery. It is widely believed that the one of the most promising approaches to evaluating XPath queries is through the use of structural joins. Previous work in the literature has found theoretically optimal algorithms for such joins. We provide practical improvements to these algorithms, which result in signiﬁcant speed-ups. Accurate selectivity estimation is a critical problem in any database system, due to its importance in choosing optimal query plans. While good estimation techniques have been developed recently for static XML databases, there has been no work at selectivity estimation in the presence of updates. We provide the ﬁrst dynamic selectivity estimation technique for XML databases, by generalizing existing techniques for the static case, and utilizing recent data stream results. We also highlight the diﬃculty of the problem by giving a correspondence to a well-known data mining problem, the frequent itemsets problem.

i

Acknowledgements

AFFIANCED, pp. Fitted with an ankle-ring for the ball-and-chain. — Ambrose Bierce (1881–1906), The Devil’s Dictionary I would like to thank Dr Raymond Wong, my supervisor, for his constant support and assistance for the duration of my thesis. He has been a continual font of ideas, and I have learnt a lot about all aspects of working both as part of a research team and as part of the wider research community. Two other members of the XML research group, Franky Lam and William Shui, have also made invaluable contributions to this thesis. I spent far too many late nights working with them for my own good, and the result is that much of this work (Chapters 2 and 3) was done in conjunction with them. On a practical note, I am particularly grateful for Franky’s skill with xfig — many of the diagrams in this thesis were drawn by Franky, or derived from one of Franky’s drawings. Professor John Cannon at the University of Sydney has been a wonderful full-time employer these past two years. My thanks to him (and my co-workers at the Computational Algebra Group) for providing a stimulating work environment, and for his seemingly inﬁnite patience and ﬂexibility while I waded through part-time honours. I would also like to thank my family, currently spread across three countries, for their support from afar. I certainly would never have gotten to my honours year without them. Last, but certainly not least, I would like to thank my ﬁanc´e Chenoa, as without her love, patience, e support, and meals, this thesis would have been impossible — being aﬃanced to her is certainly not as bad as Ambrose Bierce makes it out to be.

ii

Related Publications

1. Damien K. Fisher, Franky Lam, William M. Shui, Raymond K. Wong. Eﬃcient Ordering for XML Data. In Proceedings of the Twelfth International Conference on Information and Knowledge Management (CIKM), New Orleans, November 3-8, 2003. 2. Damien K. Fisher, Franky Lam, William M. Shui, Raymond K. Wong. Fast Ordering for Changing XML Data. Submitted to the 9th International Conference on Extending Database Technology (EDBT), Heraklion, Crete, March 14-18, 2004. 3. Damien K. Fisher, William M. Shui, Franky Lam, Raymond K. Wong. Eﬀective Clustering Schemes for XML Databases. Submitted to the 9th International Conference on Database Systems for Advanced Applications (DASFAA), Jeju Island, Korea, March 17-19, 2004.1 4. Franky Lam, William M. Shui, Damien K. Fisher, Raymond K. Wong. Skipping Strategies for Eﬃcient Structural Joins. Submitted to the 9th International Conference on Database Systems for Advanced Applications (DASFAA), Jeju Island, Korea, March 17-19, 2004.

1 As

the results of this paper are only preliminary, they do not appear in this thesis.

iii

Contents

1 Introduction 2 Maintaining Order 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 The Order Maintenance Problem . . . . . . . . . . . . . 2.2.2 Ancestor-Descendant Relationships . . . . . . . . . . . . 2.3 Formal Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 A Naive Sorting Algorithm . . . . . . . . . . . . . . . . 2.4 A Randomized Algorithm . . . . . . . . . . . . . . . . . . . . . 2.5 Improving Bender’s Algorithm . . . . . . . . . . . . . . . . . . 2.5.1 Overview of Bender’s Algorithm . . . . . . . . . . . . . 2.5.2 An Improvement . . . . . . . . . . . . . . . . . . . . . . 2.6 Utilizing Schema Information . . . . . . . . . . . . . . . . . . . 2.6.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . 2.6.2 Converting a DTD to a CFG . . . . . . . . . . . . . . . 2.6.3 Linearizing a CFG . . . . . . . . . . . . . . . . . . . . . 2.6.4 Using a Linearization . . . . . . . . . . . . . . . . . . . 2.6.5 Annotating a Linearization . . . . . . . . . . . . . . . . 2.6.6 Coalescing . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.7 Using an Annotated Linearization . . . . . . . . . . . . 2.6.8 Issues with Other Schema Languages . . . . . . . . . . . 2.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Experiment Set 1: Evaluation of Randomized Algorithm 2.7.2 Experiment Set 2: Evaluation of all Algorithms . . . . . 2.8 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.1 Ancestor-Descendant Relationships . . . . . . . . . . . . 2.8.2 Query Optimization . . . . . . . . . . . . . . . . . . . . 2.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Eﬃcient Structural Joins 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . 3.2 Related Work . . . . . . . . . . . . . . . . . . . . 3.2.1 Ancestor-Descendant Labeling Schemes . 3.2.2 Structural Joins . . . . . . . . . . . . . . 3.3 Architectural Assumptions . . . . . . . . . . . . . 3.4 Skip Joins . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Skip Joins for Ancestor-Descendant Joins 3.4.2 Skip Joins for Ancestor Structural Joins . 3.4.3 Skip Join for Descendant Structural Join 3.4.4 Skipping Strategies . . . . . . . . . . . . . 3.4.5 Skipping For Streaming Data . . . . . . . 3.5 Experimental Results . . . . . . . . . . . . . . . . 3.5.1 Experimental Setup . . . . . . . . . . . . 3.5.2 Results and Observations . . . . . . . . . iv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 4 4 6 6 7 8 8 9 9 11 11 13 15 16 17 19 20 21 23 23 24 25 25 26 29 29 30 31 33 33 34 34 35 37 38 40 41 42 42 43 44 44 45

3.6

3.7

3.5.3 Summary . . . . . . . . . . . . Label Maintenance for Changing Data 3.6.1 Theoretical Analysis . . . . . . 3.6.2 Experimental Justiﬁcation . . . Conclusions . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

47 47 48 50 53 54 54 55 55 56 57 59 59 60 64 64 64 66 66 69 69 69 70 70 72 72 74 82

4 Dynamic Selectivity Estimation 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Selectivity Estimation for Relational Databases . . . . . 4.2.2 Selectivity Estimation for XML Databases . . . . . . . . 4.3 The Diﬃculty of Selectivity Estimation for Path Expressions . 4.4 Relevant Results from Data Stream Research . . . . . . . . . . 4.4.1 The Distinct Values Problem . . . . . . . . . . . . . . . 4.4.2 Finding High Frequency Items . . . . . . . . . . . . . . 4.5 Selectivity Estimation for Simple Path Expressions . . . . . . . 4.5.1 Problem Deﬁnition . . . . . . . . . . . . . . . . . . . . . 4.5.2 Pruned Path Trees . . . . . . . . . . . . . . . . . . . . . 4.5.3 Markov Tables . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Dynamic Markov Table Techniques . . . . . . . . . . . . 4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . 4.6.2 Choosing a Query Workload . . . . . . . . . . . . . . . . 4.6.3 Measuring the Synopsis Size . . . . . . . . . . . . . . . . 4.6.4 Experiment 1: Comparing against Static Markov Tables 4.6.5 Experiment 2: Evaluating Insertion Performance . . . . 4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusions A Dynamic Selectivity Experimental Graphs

v

Chapter 1

Introduction

The beginning is the chiefest part of any work. — Plato (circa 427–347 B.C.) XML [12] has achieved widespread popularity as a data representation language in a remarkably short period of time, due to its ability to model heterogenous data in a homogenous fashion. Along with this rising popularity comes an increasing need to store and query large XML repositories eﬃciently. This has proved to be an extremely challenging task which has provoked several years of active database research, with XML current a major research area within the ﬁeld of databases. While XML has many peculiarities and subtle issues arising from the design-by-committee approach of the W3C, at heart it is an extremely simple data markup language. Ignoring the many uninteresting subtleties present in the speciﬁcation, we can describe XML as a markup language for the representation of ordered, unranked, ﬁnite, node-labeled trees of data. Figure 1.1(a) gives a sample XML document, with the corresponding logical tree given in Figure 1.1(b). This small XML document represents a portion of a database of books, and highlights several of the reasons XML has proved popular as a data representation language: • XML is semi-structured : as can be seen from the ﬁgure, there is no ﬁxed tuple structure for a record in the database of books. For instance, in the ﬁrst record, there is a date element, which does not occur in the second record. In a relational database, the standard approach to this variability is the use of optional attributes, implemented through the use of null values. The presence of null values makes query optimization and formulation considerably more diﬃcult. Similarly, while both records have a genre element, and both records refer to science ﬁction books, the actual text of the elements are diﬀerent (“Science Fiction” versus “Sci-Fi”). In a traditional relational database, this would complicate the storage and querying of this attribute, because it would most likely have to be stored as a string, instead of, for instance, an enumeration. This heterogeneity also complicates the process of querying the database in a consistent fashion. • XML is ordered : in the record for Garden of Rama, two authors are listed. In practice, it is frequently the case that the order of the authors is signiﬁcant, with the ﬁrst author generally being the principal author. Semi-structured data [1], on the other hand, is unordered, and hence ordering constraints would have to be manually speciﬁed by the user, e.g., through the use of a numeric attribute specifying the relative order. Similarly, traditional relational databases operate on unordered sets of tuples, which is a considerably easier task. While ordering could be maintained manually in earlier database systems, the issue was not addressed by the research community until only recently, in the context of XML databases. • XML is hierarchical : the ﬁgure demonstrates a simple two-level XML document which could be mapped into a relational table with little diﬃculty. However, XML has no restrictions on depth, and trees with three or more levels arise frequently in practice. For example, instead of a simple ﬂat structure as given in the ﬁgure, the data could be given grouped by publisher or genre. While this could be mapped into a group of relational tables, it is frequently the case in practice that decomposing XML documents into multiple tables leads to a huge number of tables, which can make query processing extremely ineﬃcient. 1

Frank Herbert Dune Science Fiction January 2003 Arthur C. Clarke Gentry Lee Garden of Rama Sci-Fi

books book author Frank Herbert title date January 2003 genre author Arthur C. Clarke author title book genre Sci−Fi

Dune Science Fiction

Gentry Lee Garden of Rama

(a) An XML document

(b) The corresponding tree

Figure 1.1: A sample XML document One of the reasons relational databases have proved to be so eﬀective is that they generally ignore issues such as those outlined above: their simplicity is both their greatest strength and their greatest weakness. Recent years have given rise to a demand for database systems which can manipulate less structured data. A particularly common application is that of data integration, where data from various heterogenous sources must be queried using a uniform query interface. There will always be a signiﬁcant place for relational databases in data management systems, as their success in handling a wide range of problems is undeniable. However, data management is at the point where the issues listed above, previously ignored in database systems, must now receive attention. Thus, there has been an tremendous amount of research in recent years dedicated to these issues. This thesis addresses several issues which arise when querying XML data. Our focus is at the physical, not logical, level. More precisely, in this thesis we attack problems relating to the storage and querying of XML data at a low level, as opposed to higher level issues such as query optimization. We address three separate problems in this domain: • In Chapter 2, we address the issue of order in dynamic XML databases, which is surprisingly only now being addressed in the literature. Standard XML query languages such as XQuery [93] require query results to be returned sorted into document order, but many standard index structures, such as the B-tree, do not maintain this ordering. We investigate how to implement an eﬃcient ordering operator which can cope with insertions and deletions. We also relate the problem to an important type of query upon XML data, the ancestor-descendant query. Due to the intrinsically ordered nature of XML data, our work should apply to both native and relational XML systems. • Following on from our work on order, in Chapter 3 we extend previous work to develop an eﬃcient join operator, the structural join, for answering ancestor-descendant queries. The structural join is a crucial operator in evaluating path queries over massive XML data sets, and has been shown to be far more eﬃcient than standard relational techniques; hence, the results of this chapter also have wide applicability. Taking the work of Chapter 2 into account, we also investigate the relative eﬃciency of two popular schemes for maintaining ancestor-descendant information during insertions and deletions. • The ﬁnal chapter, Chapter 4, develops techniques for estimating the selectivity of queries upon a dynamic XML database. Previous work in this area only applies to static XML databases, which is a considerably easier, though still diﬃcult, problem. We ﬁrst establish the diﬃculty of the problem by providing a correspondence between it and a well-known data mining problem, the frequent itemsets problem. We then generalize existing selectivity techniques to handle dynamic data, and 2

examine their accuracy empirically. The techniques developed have wide application, as selectivity estimation is a fundamental tool for any query optimization framework. As relational optimizers which make use of a structural join will have a need for XML-speciﬁc selectivy estimation, the work of this chapter has applicability to both native and relational XML systems.

3

Chapter 2

Maintaining Order

Good order is the foundation of all things. — Edmund Burke (1728–1797)

2.1

Introduction

In recent years XML [12] has emerged as the standard for information representation and exchange on the Internet. Since XML data is self-describing, XML is one of the most promising means to deﬁne semistructured data [1]. Although XML and semistructured data are similar, there are some well known diﬀerences [33, 52], and the most signiﬁcant of these concerns data ordering [55]. Researchers have already addressed the issue of order at the data model and query language level [33, 52] when adapting their work on semistructured data to XML. Although emerging standard XML query languages (e.g., XPath 2.0 [92] and XQuery [93]) require the output of queries to be in document order by default, most research work, from optimizing XML queries [78] to publishing data in XML [41], has ignored the issue of eﬃciently maintaining results in document order. To produce results in document order without an eﬃcient sort operator, each query operator involved in query processing will have to preserve order. This limits the kinds of indices that can be used, and hence the number of ways in which a query can be evaluated. Furthermore, methods from query optimization for unordered, semistructured data, e.g., [78], cannot be reused to handle ordered data eﬃciently. For example, consider the XML database shown in Figure 2.1 (in the ﬁgure, each node’s identiﬁer is of the form oid: value , where oid is unique to each node). The document order on the tree corresponds to the ordering constructed via a preorder traversal. Suppose the user wishes to ﬁnd all hard drives manufactured by “ABC”, and which cost less than $200; in this case, they might issue the following XPath query: //Harddisks/Item[Price < 200][Brand="ABC"] Ignoring order for the moment, we may employ indices and optimization techniques such as those proposed in the Lore database system [77, 78]. For example, we assume that a T-index (a hash index on the text values) and a V-index (a B + tree index on numeric values) have been created. Then optimal query plans may include the following two possibilities: 1. hash("ABC")/parent::Brand/parent::Item[parent::Harddisks][Price < 200] 2. bptree( ... Figure 2.2: An excerpt from the DBLP DTD. When we want to model the empty string, we will follow the normal convention, and use the special symbol ǫ, which we implicitly add to the set of terminals of the language. We will ignore this technicality throughout our presentation, and simply use ǫ where convenient. We now give the straightforward process of converting a DTD to a CFG which is, for our purposes, equivalent to the DTD. For simplicity, we only consider elements, ignoring other node types such as attributes and processing instructions. Similarly, because we are uninterested in data values, we ignore occurences of #PCDATA in the DTD. With these simpliﬁcations in mind, we deﬁne a DTD as follows: Deﬁnition 2.8 (DTD) Let T be the countably inﬁnite set of element types. Then a DTD is a ﬁnite set { t, r | ∀t ∈ T }, where T is a ﬁnite subset of T , and each r is a regular expression over T . The only subtle issue remaining with this deﬁnition is the #ANY keyword, which can represent any combination of elements. To handle this, we adjoin to T above a special element type α, which represents all element types not appearing in the DTD, i.e., all element types in T − T . We replace any occurrence of #ANY with the regular expression (α|t1 | . . . |tn )∗ , where {t1 , . . . tn } = T . Similarly, we add a rule α, (α|t1 | . . . |tn )∗ to the DTD, so that α is a well-deﬁned element type. It is clear that this deﬁnition preserves the semantics of #ANY. Herein, we assume that, if necessary, the rule for α has been added to the DTD, and α has been added to its set of types T . We note that, at this point, our DTD only refers to a ﬁnite set of elements, because we have merged the remaining (inﬁnite) number of element types together into a single, special element “type”, α. Thus, we have in essence made the alphabet for our CFG ﬁnite, and hence ensured that the linearization graph we construct is ﬁnite. From this set of rules, it is easy to construct the desired CFG. The set of terminals Σ becomes the set of types in the DTD T . For each rule t, r in the DTD, we deﬁne a rule Rt → tSr , where Sr is a variable deﬁning the regular expression r. The variable Sr is obtained from r by ﬁrst converting r from a regular expression over T to a regular expression r′ over RV → {Rt | t ∈ T }, using the obvious isomorphism. Using well-known techniques, we then convert r′ to a context free grammar Vr , RV , Rr , Sr . Finally, we obtain our complete CFG as: {Rt } ∪ t∈T r

Σr , T, t∈T {Rt → tSr } ∪ r Rr , S

The only remaining problem is to deﬁne the start symbol S in the deﬁnition above. Unfortunately, DTDs do not have a “start element” we can map to S. However, in practice, we can always ﬁnd an element which appears as the root element in every document designed to conform to the DTD, and if this element is t, then we set S = Rt . Alternatively, we could follow the DTD semantics more precisely and deﬁne a start symbol with additional rules {S → Rt | ∀t ∈ T }, so that all DTD types are permitted to be the root type; as it turns out, the choice makes no diﬀerence to the construction. We can summarize the above in the following theorem, whose proof follows directly from the construction: Theorem 2.3 Given a DTD, we can obtain a CFG, such that if a document D is accepted by the DTD then the document, when considered as a list of nodes in document order, is accepted by the CFG. Conversely, for any string accepted by the CFG, there exists a document accepted by the DTD which, when considered as a list of nodes in document order, is equivalent to the string. Thus, we have found a construction algorithm which solves the ﬁrst part of the linearization problem for DTDs. As an example of the translation process in action, consider the fragment of the DBLP DTD given in Figure 2.2. The translation of this DTD fragment into a set of regular expressions is given in Figure 2.3, and using our translation process plus a few simpliﬁcations, we obtain the CFG of Figure 2.4. We will use this as a running example throughout the remainder of this chapter. 18

dblp = (article|inproceedings| . . .)∗ article = (author|editor| . . .)∗ inproceedings = (author|editor| . . .)∗ author = ǫ editor = ǫ Figure 2.3: The DTD fragment of Figure 2.2, converted into a set of regular expressions.

R1 → dblp R2 R 2 → ǫ | R 2 R 3 | R 2 R4 R3 → article R5 R4 → inproceedings R5 R5 → ǫ | R5 author | R5 editor Figure 2.4: The DTD fragment of Figure 2.2, converted into a CFG.

2.6.3

Linearizing a CFG

We now consider the second part of the linearization problem for DTDs: given a CFG, how do we construct its linearization graph? Fortunately, it is easy to give a recursive construction algorithm for this task. Throughout this section, we will ﬁx a CFG V, Σ, R, S for which we wish to ﬁnd the linearization graph. Our ﬁrst step in this task is to deﬁne a few useful functions which we will make use of in our construction algorithm. The ﬁrst function we deﬁne is ACCEPTS-EMPTY, which, given a CFG, determines whether it accepts the empty string or not. The pseudocode for this function can be found in Algorithm 2.6. Essentially, a simple dynamic programming approach is used, which determines whether each rule accepts the empty string. We note that this function, and the others we deﬁne next, only work correctly on CFGs which accept at least one ﬁnite string. As CFGS which do not accept any ﬁnite strings are pathological, and not of the kind we are interested in (CFGs constructed from DTDs are guaranteed to accept at least one ﬁnite string), we will ignore these cases in this section. The proof of correctness is easy: Lemma 2.5 Algorithm 2.6 returns true if and only if a CFG accepts the empty string. Proof : Firstly, Algorithm 2.6 must always terminate, because it visits each variable at most once, and the number of variables is ﬁnite. Now suppose that Algorithm 2.6 returns true. In this case, if we trace out its execution path, we get a valid parse tree for the empty string, and hence the grammar must accept the empty string. Conversely, suppose the grammar accepts the empty string, so that there is a parse tree matching it. We can be guaranteed to ﬁnd a parse tree such that in every root-to-leaf path, each variable occurs only once. If this were not the case, then for each such path V1 → V2 → V → . . . → V → . . . → Vn , we can delete the portion of the tree corresponding to the part of the path between the two non-unique occurrences, and still have a valid parse tree for the empty string. Hence we are left with a parse tree which corresponds to a traversal of the variables in the grammar, where every variable is visited at most once in each path. From this parse tree, we can construct an execution path through the algorithm. Using the function ACCEPTS-EMPTY, we now deﬁne a function, FIRST, which, given a CFG, ﬁnds the set of terminals with which all strings accepted by the CFG must begin. The pseudocode is given in Algorithm 2.7; as the correctness proof for this algorithm is very similar to that of ACCEPTS-EMPTY, it is omitted. Finally, we deﬁne a function LAST, which ﬁnds the set of terminals with which all strings accepted by the CFG must end. As this algorithm is virtually identical to that of Algorithm 2.7 (instead of the ascending iteration in line 5, a descending iteration is performed), it is omitted. While we have only deﬁned FIRST and LAST to return the sets of terminals corresponding to a particular variable (in particular, the start variable), we can extend these functions to arbitrary strings over V ∪Σ as follows: if we have such a string s, then add a rule to the grammar T → s, where T is some terminal 19

Algorithm 2.6: Determines whether a CFG accepts the empty string. ACCEPTS-EMPTY( V, Σ, R, S ) 1 EMPTY[v] ← unknown ∀v ∈ V (EMPTY is a global variable) 2 ACCEPTS-EMPTY-RECURSE( V, Σ, R, S , S, ∅) 3 return EMPTY[S] ACCEPTS-EMPTY-RECURSE( V, Σ, R, S , curr, seen) 1 if EMPTY[curr] = unknown ∨ curr ∈ seen then 2 return 3 seen ← seen ∪ {curr} 4 for each rule curr → a1 . . . an ∈ R do 5 if n = 1 ∧ a1 = ǫ then 6 EMPTY[curr] = true 7 return 8 for i ∈ {1, . . . , n} do 9 if ai ∈ V then 10 ACCEPTS-EMPTY-RECURSE( V, Σ, R, S , ai , seen) 11 if EMPTY[ai ] = true then 12 break 13 else 14 break 15 EMPTY[curr] ← true 16 return 17 EMPTY[curr] ← f alse

not currently in V . Then, run FIRST and LAST on the context free grammar V ∪{T}, Σ, R ∪ {T → s}, T , and use these sets as the result for this string. Thus, given a string of terminals and variables, FIRST and LAST return the sets of terminals with which strings accepted by the rule deﬁned by that string must begin or end. We are now in a position to give the construction algorithm for obtaining the linearization graph from a CFG. In essence, we recursively build the edge set for the graph from the edge sets for the linearization of subexpressions in the CFG. The entire process is given in Algorithm 2.8. Note that we only need to construct the edge set incrementally, as we can leave the vertex set of the graph to be the entire alphabet Σ throughout. We summarize the correctness of this construction in the following theorem: Theorem 2.4 Algorithm 2.8 constructs the linearization graph for a CFG. The proof of this theorem follows directly from the fact that running the extraction function E on the language deﬁned by the grammar is equivalent to the process which happens in line 15 of the algorithm.

2.6.4

Using a Linearization

Once we have a linearization, it is a simple matter to extract document ordering information from it. We construct the graph of strongly connected components G′ for the linearization G, using standard techniques [26]. It is clear that G′ is a directed acyclic graph. The following lemma easily follows from the deﬁnition of the linearization graph (Deﬁnition 2.6): Lemma 2.6 Let G = V, E be the linearization of a DTD, and G′ = V ′ , E ′ be the graph of strongly connected components of G. Then for any document D validated by the DTD, for any nodes x, y ∈ D, if [t(y)] is reachable from [t(x)] in G′ , then x < y in document order, where t(x) is the type of x and [v] ∈ V ′ is the strongly connected component of vertex v ∈ V in G. Thus, ordering information can be easily computed at database creation time by traversing G′ , and storing the relative ordering between the strongly connected components in an array of size O(|V ′ |2 ). 20

Algorithm 2.7: Determines with which terminals a string accepted by a CFG must begin. FIRST( V, Σ, R, S ) 1 ACCEPTS-EMPTY( V, Σ, R, S ) (we make use of the EMPTY global variable) 2 FIRST[v] ← ∅ ∀v ∈ V 3 FIRST-RECURSE( V, Σ, R, S , S, ∅) 4 return FIRST[S] FIRST-RECURSE( V, Σ, R, S , curr, seen) 1 if curr ∈ seen then 2 return 3 seen ← seen ∪ {curr} 4 for each rule curr → a1 . . . an do 5 for i ∈ {1, . . . , n} in ascending order do 6 if ai ∈ V then 7 FIRST-RECURSE( V, Σ, R, S , ai , seen) 8 FIRST[curr] = FIRST[curr] ∪ FIRST[ai ] 9 if EMPTY[ai ] = true then 10 break 11 else 12 FIRST[curr] = FIRST[curr] ∪ {ai } 13 break

More sophisticated techniques can be used if |V ′ | is large, although in practice it generally is not. We do not know the relative order of nodes lying within the same strongly connected component. In this case, we can apply a document ordering algorithm such as Bender’s algorithm; to see why this is so, it is suﬃcient to note that each strongly connected component is a contiguous block in document order, because the document ordering on nodes carries through to an ordering on the strongly connected components (by Lemma 2.6). Note that we can apply a diﬀerent instance of the algorithm to each strongly connected component, which can result in reducing both the size of the tag and decreasing the cost of updates. As an extension, if we have selectivity information indicating the approximate size of each strongly connected component, then we can use this information to choose the optimal algorithm for each component. For instance, for very large components it may be suitable to choose the O(1) variant of Bender’s algorithm, but for smaller components we can choose an algorithm with less space overhead.

2.6.5

Annotating a Linearization

While the scheme described so far has been very general, it has some limitations in practice. We have found that many schemas used in the real world are signiﬁcantly looser than they might ideally be, and that this inexactness propogates itself into the analysis above. For instance, consider the linearization in Figure 2.5. In this case, it is clear that there are only two strongly connected components, one consisting of the single node type dblp. Clearly, this is not particularly useful information, because, as dblp elements are the root elements in practice, we have only been able to deduce the relative order of a single node. However, we can still use type information in a heuristic manner which gives good practical performance in many cases. In Figure 2.5, we have annotated the linearization with dotted edges, which represent that, in our particular database system, from any node we can access its parent node. In fact, we also have an accessor which allows any parent node to access its ﬁrst child. This accessor does not appear in Figure 2.5, because the DTD does not give us enough information to determine what the ﬁrst child is. For instance, consider a node of type article. From the DTD, we are unable to infer whether this node will even have a ﬁrst child. On the other hand, we always know that any node (except dblp) will have a parent. We will now formalize the notion of an acceptable accessor: Deﬁnition 2.9 If G = V, E is the linearization graph, then an accessor is a map from φ : V1 → V2 , 21

Algorithm 2.8: Given a CFG, return the edge set for the corresponding linearization. LINEARIZE( V, Σ, R, S ) 1 return LINEARIZE-RECURSE( V, Σ, R, S , S, ∅) LINEARIZE-RECURSE( V, Σ, R, S , a1 a2 . . . an , seen) 1 if n = 1 ∧ a1 ∈ V then 2 if a1 ∈ seen then 3 return 4 seen ← seen ∪ {a1 } 5 E←∅ 6 for each rule curr → a′ . . . a′ ′ do 1 n 7 E ← E ∪ LINEARIZE-RECURSE( V, Σ, R, S , a′ . . . a′ ′ ) 1 n 8 return E 9 elif n ≤ 1 then 10 return ∅ 11 else 12 m ← ⌊n⌋ 2 13 E1 ← LINEARIZE-RECURSE( V, Σ, R, S , a1 . . . am ) 14 E2 ← LINEARIZE-RECURSE( V, Σ, R, S , am+1 . . . an ) 15 return E1 ∪ E2 ∪ { x, y | ∀x ∈ LAST(a1 . . . am ), ∀y ∈ FIRST(am+1 . . . an )}

where V1 , V2 ⊆ V , such that ∀x ∈ V1 , φ(x) can be retrieved from x. Not all accessors are particularly interesting in terms of document ordering. The interesting accessors are those that allow us to infer some information about the relative ordering between nodes. For instance, the parent accessor is useful, because if two nodes have diﬀerent parents, then clearly their relative order is related to the relative order of their parents. Thus, we are interested in order preserving accessors: Deﬁnition 2.10 An order preserving accessor is an accessor φ : V1 → V2 such that ∀x, y ∈ V1 , if x < y then φ(x) ≤ φ(y). The accessors available to us diﬀer depending upon the underlying implementation. As an example, we deﬁne the parent accessor for a particular element type t. Let V1 be the set of all element types that appear alongside t in regular expressions in the DTD (that is, V1 is the set of all possible sibling types of t), and and V2 be the element types which have deﬁning regular expressions containing an element type in V1 . Then in our implementation there is a map φ : V1 → V2 which yields the parent of any v ∈ V1 . This accessor is illustrated in Figure 2.5 — the regions surrounded by dotted areas are the domain and codomain of this accessor, with the dotted lines representing the movement through the accessor. While we are interested in order preserving accessors, we are not interested in all of them. For instance, suppose it is possible to access the root node of the database. Then it would be trivial to deﬁne an accessor from the set of all element types V which accessed the root. This would not be a terribly useful accessor in terms of document ordering, however, because ∀x, y ∈ V , the image under the map would be equal, and hence traversing the accessor would never yield additional information about the relative order of x and y. In this case, the cost of using the accessor outweighs the beneﬁts. Therefore, we introduce a threshold parameter, controlled by the database creator, which speciﬁes the maximum acceptable cost of an accessor. This cost could be measured in one of several ways, and is mainly dependent upon the underlying implementation. For the rest of our chapter, we will use the cost function that was best suited to our implementation — the probability of incurring a disk access by using an accessor. Under this cost measure, an accessor which frequently accessed objects lying in the same page would score more highly than an accessor which accessed objects in other pages. For a given set of accessors, we assume we have the approximate probability of a disk access incurred by following that accessor. This information could come from one of several sources: • The clustering subsystem of the database; 22

dblp

article

inproceedings

author

editor

Figure 2.5: Linearization of the DTD fragment of Figure 2.2.

• The database creator; or • An analysis of a small representative sample of the data to be stored in the database. We do not believe that in practice it is diﬃcult to produce these cost estimates. For instance, if DBLP is stored in document order, then it is fairly obvious that each record (where a record is a complete subtree rooted at article, inproceedings, etc.) will generally lie on a single page, and hence for intra-record accesses the likelihood of a disk access is quite low.

2.6.6

Coalescing

Once we have a set of accessors, along with their corresponding costs, we coalesce portions of the linearization. Informally, we merge nodes which have no accessors between them, and which lie in the same strongly connected component. The motivation is that as we are unable to infer any information about the relative ordering of nodes in the database corresponding to these types, there is no point in maintaining them as separate nodes in the linearization. Given a set of accessors A, we coalesce using a simple greedy heuristic. Firstly, we remove any accessor φ : V1 → V2 ∈ A such that V1 or V2 are not strictly contained in a strongly connected component of the linearization. We then repeat the following steps until A is empty: 1. Remove the accessor φ : V1 → V2 ∈ A which has the smallest cost, and annotate the graph with φ. 2. Merge the nodes in V1 into a single node, and similarly merge the nodes in V2 into another node in the graph. 3. Remove any φ′ : V1′ → V2′ ∈ A where either V1′ or V2′ have a non-empty intersection with either V1 or V2 . The process of running this procedure on the linearization of Figure 2.5 is shown in Figure 2.6. In this example, we assigned a high cost to the accessor from article and inproceedings to dblp, because this almost always incurred a disk read. Similarly, a low cost was assigned to the accessor from author and editor to article and inproceedings, because these frequently lay on the same disk page, as we clustered the nodes in the database into document order.

2.6.7

Using an Annotated Linearization

Using an annotated linearization is similar to using a linearization. We use the techniques described in previous sections when comparing nodes belonging to two diﬀerent strongly connected components. 23

dblp

article | inproceedings

author | editor

Figure 2.6: Linearization of the DTD fragment of Figure 2.2 after coalescing.

When the nodes lie in the same strongly connected component, however, we now see if they lie in the same node in the linearization (which may be the result of merging several types). If they are, and this node has an available accessor, we ﬁrst traverse the accessor and determine the relative ordering under the image of the accessor. Only in the event that this does not resolve the relative order of the two nodes do we compare them using their own tag values. This scheme works because we have split portions of the strongly connected components, in such a way that we can again apply a diﬀerent instance of the document ordering algorithm. For instance, in Figure 2.6, we can apply a separate instance of the document ordering algorithm to each node in the graph. In fact, for the author|editor node, we apply a separate instance of the ordering algorithm to each set of nodes S such that ∀x, y ∈ S, φ(x) = φ(y). It is easy to see that such a set of nodes must be contiguous (in document order), from the deﬁnition of an order preserving accessor. More precisely, for any node in the graph with an outgoing accessor (the heuristic above ensures there is at most one outgoing accessor), we apply a separate instance of an ordering algorithm to each set of nodes S such that ∀x, y ∈ S, φ(x) = φ(y). Thus, we have reﬁned the sets upon which we maintain ordering information, at the cost of a more complicated query mechanism.

2.6.8

Issues with Other Schema Languages

While we have only discussed the linearization of DTDs, it is not diﬃcult to conceive of similar strategies for dealing with other popular schema languages, such as XML Schemas. However, there are a few issues that are worth raising: • It is possible, and perhaps desirable, that a schema language might oﬀer the ability for the user to specify that certain parts of an XML document are set oriented, instead of list oriented, and hence that ordering information should not be maintained for these portions. For example, in the DBLP database, while ordering might be important within a record, it is unlikely that it is important outside the record. It would be easy to accommodate such a feature within our framework, but it should be noted that sometimes, it may be easier to maintain ordering information in unordered regions anyway. For instance, suppose we have a linked list of elements O1 → U1 → U2 → O2 , where ordering information must be maintained for O1 and O2 , but not for U1 and U2 . Unless there is a way to navigate directly from O1 to O2 , it will probably be faster to maintain ordering information for U1 and U2 too, since they must be traversed when moving between O1 and O2 . • Some schema languages, notably XML Schema, allow the use of local types, that is, one element might have distinctly diﬀerent types, depending on its local context. This feature can be easily incorporated into our overall framework, by changing the vertex set of the linearization graph from the set of all elements to the set of all unique local types. 24

100 Records

2.5

O(1) O(log n) Random

2

1.5 Time (s) 1 0.5 0 0 x #Reads 10 x #Reads 100 x #Reads

Figure 2.7: Results for database of 100 records

2.7

Experimental Results

We performed two sets of experiments, each measuring diﬀerent aspects of the algorithms presented above. In our ﬁrst set of experimental results, we evaluated the quality of the average case for the randomized algorithm above. In the second set, we evaluated the worst case of all the algorithms. All experiments were run on a dual processor 750 MHz Pentium III machine with 512 MB RAM and a 30 GB, 10,000 rpm SCSI hard drive.

2.7.1

Experiment Set 1: Evaluation of Randomized Algorithm

We performed our ﬁrst set of experiments using the DBLP database. We tested both Bender algorithms (the O(log n) and O(1) variants) and the randomized algorithm of Section 2.4. For each algorithm, we inserted 100, 1000, and 10,000 DBLP records into a new database. The insertions were done in two stages. The ﬁrst half of the insertions were appended to the end of the database, and hence simulated a bulk load. The second half of the insertions were done at random locations in the database; that is, if we consider the document as a linked list in document order, the insertions happened at random locations throughout the list — this stage simulated further updates upon a pre-initialized database. While the inserts were distributed over the database, at the physical level the database records were still inserted at the end of the database ﬁle. This resulted in a database which was not clustered in document order, which meant that traversing through the database in document order possibly incurs many disk accesses. We hypothesize that while many document-centric XML databases will be clustered in document order, data-centric XML databases will not be, as they will most likely be clustered through the use of indices such as B-trees on the values of particular elements. Hence, our tests were structured to simulate these kinds of environments, in which the document ordering problem is more diﬃcult. At the end of each set of insertions, there were n records in the database, where n ∈ {100, 1000, 10,000}. We then additionally performed 10n and 100n reads upon the database. Each read operation chose two random nodes from the database and compared their document order. The nodes were not chosen uniformly, as this does not accurately reﬂect real-world database access patterns. Instead, in order to 25

1000 Records

18

16

O(1) O(log n) Random

14

12

Time (s)

10

8

6

4

2

0 0 x #Reads 10 x #Reads 100 x #Reads

Figure 2.8: Results for database of 1000 records

emulate the eﬀect of “hot-spots” commonly found in real-world database applications, we adopted a n normal distribution with mean n and variance 10 . 2 Figures 2.7 through 2.9 show the results from our experiments. There are several interesting things to note from our experiment. Firstly, the O(1) algorithm of Bender is easily slower than the O(log n) algorithm, even in the case where no reads were performed. Investigation revealed that this was due to the fact that the O(1) variant’s maintenance of the two level ordering structure did not match the access patterns present in the experiment: because during relabelings the ordering structure is accessed in document order, but 50% of the insertions occurred at random positions in the document, the insertion performance of the algorithm was seriously aﬀected. The relative performance gap also becomes more noticeable as the number of reads increases, due to the extra level of indirection imposed in the comparison function by the O(1) algorithm. Secondly, the performance of our randomized algorithm is clearly encouraging, as it is ahead of the other algorithms by a comfortable amount in all tests. The performance gap between the other algorithms and the randomized algorithm also increases as the number of records increases, which indicates that our algorithm performs better on large databases than the others.

2.7.2

Experiment Set 2: Evaluation of all Algorithms

Experimental Setup We performed several experiments to determine the eﬀectiveness of the various algorithms covered in this chapter. We used as our data set the DBLP database. Due to the fact that there are, at the time of this writing, very few native XML databases available, we performed our experiments on a custom-written simple disk-bound database not supporting transactions or concurrency. While the database kernel code was not optimized, we believe that our results will give good indications of performance in other native XML databases. Our experiments were designed to investigate the worst case behavior of each of the document maintenance algorithms implemented. We chose to focus on the worst case for several reasons. Firstly, our previous work found an algorithm which has very good average case performance (for a particular deﬁnition of average), but very poor worst case performance. As our new work has good worst case performance, we emphasize this in our experiments. More practically, however, at the time of writing 26

10000 Records

250

O(1) 200

O(log n)

Random 150 Time (s) 100 50 0 0 x #Reads 10 x #Reads 100 x #Reads

Figure 2.9: Results for database of 10,000 records

it is very diﬃcult to determine a suitable “average” case for usage of document ordering indices, simply because XML is still an immature technology. For each experiment, we investigated the performance of four diﬀerent algorithms: 1. The randomized algorithm of Section 2.4; 2. The O(1) and O(log n) variants of Bender et al [11]; 3. The randomized variant of Bender’s algorithm, described in Section 2.5. For this algorithm, we used values of c from the set {5, 10, 50, 100, 200, 1000}; and 4. The O(log n) variant of Bender’s algorithm, but augmented with the linearization graph of the DBLP DTD, using the methods of Section 2.6. Experiment 1: Performance on a Bulk Insert In this experiment, we evaluated the performance of the algorithms under a uniform query distribution. The experiment began with an empty database, which was then gradually initialized with the DBLP database. After every insertion, on average r reads were performed, where r was a ﬁxed parameter taken from the set {0.01, 0.10, 1.00, 10.0}. Each read operation picked two nodes at random from the underlying database, using a uniform probability distribution, and compared their document order. In all of our experiments, we measured the total time of the combined read and write operations, the number of read and write operations, and the number of relabelings. However, due to space considerations, and the fact that the other results were fairly predictable, we only include the graphs for total time. This experiment was designed to test what is a worst case scenario for all the algorithms except for our randomized algorithm. We included this experiment because the extremely heavy paging incurred by the uniform query distribution demonstrates some of the practical problems with Bender’s O(1) algorithm. As can be seen from the results in Figure 2.10, the randomized algorithm is easily the best performer. The reason for this excellent performance is because on every insertion the randomized algorithm simply added one to the identiﬁer of the last node in the database. While we could have added a special case to the other algorithms to handle insertions at the end of the database in a similar fashion, we instead opted 27

Experiment 1 Total Time

12000

10000

8000 Total Time (s)

6000

4000

O(1) Bender O(log n) Bender c=5 c = 10 c = 50 c = 100 c = 200 c = 1000 Schema Randomized

2000

0 0.01 0.1 r 1 10

Figure 2.10: Experiment 1 Results to treat them no diﬀerently from insertions anywhere else in the database, to indicate the worst case performance of the algorithms. We included the randomized algorithm as a useful baseline to compare the other algorithms with, because it represents the fastest possible implementation for this experiment. We note that, as the ratio of reads increases, the performance of both Bender’s O(1) algorithm and the schema based algorithm degrades relative to the other algorithms. We attribute this in both cases to the extra indirection involved in reading from the index. Also, because of the extremely heavy paging, even the small paging overhead incurred by an algorithm such as the schema based algorithm, which only infrequently loads in an additional page in due to a read from the index, has a massive eﬀect on the performance. Thus, although this experiment is slightly contrived, it does demonstrate that in some circumstances the indirection involved becomes unacceptable, given that values of r in real life will often be 100 or 1000. We note that one advantage of the schema based algorithm is that we can remove the indirection if necessary, whereas with the O(1) algorithm, we cannot. Experiment 2: Performance on a Non-Uniform Query Distribution This experiment was identical to the ﬁrst experiment, except that the reads were sampled from a normal distribution with mean |D| , and variance |D| , and we took r ∈ {0.01, 0.10, 1.00, 100.0}. The idea 2 10 was to reduce the heavy paging of the ﬁrst experiment, and instead simulate a database “hot-spot”, a phenomenom which occurs in practice. As can be seen from the results of Figure 2.11, this experiment took substantially less time to complete than the ﬁrst experiment. It can be seen that, apart from the randomized algorithm (which again performed the minimal possible work), the schema based algorithm is clearly the best algorithm. Indeed, it is impressive that it came so close to the randomized algorithm in performance. Experiment 3: Worst-Case Performance for the Randomized Algorithm The previous two experiments showed that the randomized algorithm had very good performance. We demonstrate in this experiment that, in some cases, it has very bad performance, far worse than the other algorithms. In spite of this, we believe the randomized algorithm is worth considering, as in many situations it is almost unbeatable. This experiment was identical to the ﬁrst experiment, except that 28

Experiment 2 Total Time

8000

7000

6000 O(1) Bender O(log n) Bender 5000 Total Time (s) c= 5 c = 10 c = 50 c = 100 c = 200 3000 c = 1000 Schema Randomized 2000

4000

1000

0 0.01 0.1 1 r 10 100

Figure 2.11: Experiment 2 Results instead of inserting DBLP records at the end of the database, we inserted them at the beginning. As the experiment would take an unreasonable amount of time to run on the full DBLP, and because the worst case performance of the randomized algorithm is so pronounced in this case, we only ran the experiment on a small subset of DBLP. As can be seen from the results in Figure 2.12, all the other algorithms easily beat the randomized algorithm’s performance. Hence, in situations where worst case bounds must be guaranteed, the randomized algorithm is not a good choice. Experiment 4: Verifying the Theory Our ﬁnal experiment was performed to verify the theory of Section 2.5, in particular, the expected runtime cost of the randomized variant. For this experiment, we simply loaded DBLP into a database, maintaining ordering information using Algorithm 2.5, for values of c from the set {1, 5, 10, 50, 100, 200, 1000}. The total running time of each of these algorithms, as a proportion of the c = 1 running time, is found in Figure 2.13. The line of best ﬁt was determined using a standard logarithmic regression. As can be seen, the expected logarithmic dependence on c is exhibited very closely.

2.8

2.8.1

Applications

Ancestor-Descendant Relationships

As mentioned Section 2.1, our method can be applied to eﬃciently determine ancestor-descendant relationships. The key insight is due to Dietz [36], who noted that the ancestor query problem can be answered using the following fact: for two given nodes x and y of a tree T , x is an ancestor of y if and only if x occurs before y in the preorder traversal of T and after y in the postorder traversal of T . We note that while we have framed our discussion in terms of document order (that is, preorder traversal), our results could be equally well applied to the postorder traversal as well. Therefore, by maintaining two indices, one for preorder traversal, and one for postorder traversal, which allow ordering queries to be executed quickly, we can determine ancestor-descendant relationships eﬃciently. 29

Experiment 3 Total Time

700

600

500

Total Time (s)

400

300

200

100

0 O(1) Bender O(log n) Bender c = 50 c = 200 Algorithm c = 1000 Schema R andomized

Figure 2.12: Experiment 3 Results It is well-known that ancestor-descendant relationships can be used to evaluate many path expressions, using region algebras. In fact, some native XML databases, such as TIMBER [63], use the above technique by storing numerical identiﬁers giving the relative preorder and postorder position for each node. However, to the best of our knowledge, this is the ﬁrst work to address the eﬃcient maintenance of these identiﬁers in the context of dynamic XML databases. As an example of how structural and range queries can be answered eﬃciently, consider the XPath: //Item[.//Price > 200] This query can be answered by the following plan (where we adopt the deﬁnitions of Section 2.1). Taking: S1 S2 S3 We then ﬁnd M where: (∀n ∈ M)(M ⊆ S1 )(∃x ∈ S2 )(∃y ∈ S3 )(n post y) In the above, post is an ordering comparison in postorder traversal. We can then sort M into document order using the results in this chapter. ← hash(“Item”) ← hash(“Price”), and ← bptree(>, 200)

2.8.2

Query Optimization

A future application of our work is in the area of XML query optimization. As it is now possible to eﬃciently sort sets of nodes from an XML database into document order, it is also possible to use storage mechanisms in the database which cluster the data along common access paths. For instance, it may 30

Performance of the Randomized Variant of Bender’s Algorithm 1

0.8

Time Taken (%)

0.6

0.4

0.2

0 0 200 400 Insertions 600 800 1000

Figure 2.13: Experiment 4 Results be possible to use B-trees to allow fast access to numerical data, as we can now sort the result back into document order quickly. In order to do take advantage of this, query optimizers will have to be augmented with an accurate cost estimation function for the ordering operator. This will allow the optimizer to choose optimal locations in the query plan to sort the intermediate results into document order.

2.9

Conclusions

The contributions of this chapter are three-fold. Firstly, we have presented a simple randomized algorithm which performs well in many cases, but has poor worst case performance. Secondly, we have presented an improvement to the work of Bender et al, which yields a parameterized family of algorithms giving ﬁne control over the cost of updates to the document ordering index versus querying this index. We believe this work is of particular relevance in XML databases, and anticipate a wide range of application domains which will utilize such databases in the future, with each domain having diﬀerent requirements. Our ﬁnal contribution is a general scheme to utilize type information to improve the speed of document ordering indices. This work is especially signiﬁcant due to its wide applicability, as it is not tied to any particular ordering algorithm. We have found that in practice many large XML repositories have a DTD or some other schema for constraint purposes, and hence we expect that this work will have great practical impact. There are still many intersting topics for future research in this area. It is surprising that until now dynamic order maintenance has received little attention, as we believe it is absolutely crucial to the success of XML-based data repositories. We intend to extend work such as that of Lerner and Shasha [69] to native XML database systems, in order to produce a theoretically sound query optimization framework in the presence of ordering. Once XML query optimization is a better understood problem, it may also be possible to automate the selection of the parameter c described in Section 2.5. We also intend to 31

investigate whether there are other labeling schemes which can allow order to be maintained eﬃciently under updates.

32

Chapter 3

Eﬃcient Structural Joins

I know nothing about it; I am my own ancestor. — Andoche Junot (When asked as to his ancestry.)

3.1

Introduction

In recent years XML [12] has emerged as the standard for information representation and exchange on the Internet. However, ﬁnding eﬃcient methods of managing and querying large XML documents is still problematic and poses many interesting challenges to the database research community. XML documents can essentially be modeled as ordered trees, where nodes in the ordered tree represent the individual elements, attributes and other components of an XML document. As discussed in Chapter 2, performing a preorder traversal yields the document order of the XML document. Recently proposed XML query languages such as XPath [91] and XQuery [93], which have been widely adopted by both the research and commercial communities for querying XML documents, heavily rely upon regular path expressions for querying XML data. For example, consider the MEDLINE [79] database as an example XML document. The regular path expression //DateCreated//Month returns all Month elements that are contained by a DateCreated element. Without any indices upon the document, the typical approach to evaluating //DateCreated//Month requires a full scan of the entire XML database, which can be very costly if the document is large. Instead of this naive approach, recent work has attacked the XML query optimization problem by addressing two subproblems: • The Ancestor-Descendant Problem: How do we determine the ancestor-descendant relationship between two nodes in an XML database eﬃciently? How do we maintain index structures supporting such queries under updates (both insertions and deletions)? • The Structural Join Problem: Given two sets, one of potential ancestors and one of potential descendants, how do we eﬃciently determine all ancestor-descendant pairs from these sets? As an example of how these problems arise, consider the path expression //DateCreated//Month described above. Suppose that we can retrieve D, the set of all DateCreated elements in the database, and M , the set of all Month elements in the database. This could be achieved, for example, by storing the sets in an inverted list index, indexed by the element name. If we have a fast way of determining ancestordescendant relationships, then ∀d ∈ D and ∀m ∈ M , we can determine whether d is an ancestor of m (and hence determine whether m lies in the result for path expression). Thus, we can answer the query using O(|D||M |) tests, and hence this method will be faster than the naive traversal if O(|D||M |) = o(n), where n is the number of nodes in the database — this is quite often the case in practice. Structural join algorithms attempt to reduce the number of comparisons even further. There has been a signiﬁcant amount of work on ancestor-descendant schemes [66, 68, 70, 90, 98, 99]. However, to date most of this research has focused only on providing an eﬃcient test, and neglected the important issue of ﬁnding an ancestor-descendant scheme which supports both eﬃcient queries and which can be quickly updated due to insertions and deletions. In this chapter, we apply the work of Chapter 2 33

to the ancestor-descendant problem, in order to ﬁnd schemes satisfying both of the constraints listed above. Structural joins (also known as containment joins) are now considered a core operation in processing and optimizing XML queries. Various techniques have been proposed recently for eﬃciently ﬁnding the structural relationships between a list of potential ancestors and a list of potential descendants [7, 14, 23, 57]. Many of these proposals rely on some kind of a database index structure, such as a B+ tree. Processing structural joins using these index structures increases both the resource requirements, such as memory consumption, and maintenance overhead of the database. Furthermore, most of the schemes rely on numbering schemes which, up until now, have incurred signiﬁcant relabeling costs during data updates. Instead of using external index structures to improve the state of the art structural join [7], as previous work has done, this chapter proposes simple, eﬃcient strategies for skipping unmatched nodes during the structural join process. This chapter makes several key contributions to both of the problems described above: • We propose an improvement to the current state of the art stack based structural joins [7], which utilizes various strategies to skip nodes which do not occur in the output of the join. In contrast to previous work, our proposed extension does not require any external indices such as B-trees, and hence imposes less overhead on the underlying database system. As our proposed method does not employ any indices, and its entire operational cost is linearly proportional to the size of the query output, it can be extended to support streaming XML data. • We present extensive experimental results on the performance of our proposed algorithms, using both real-world and synthetic XML databases. We show experimentally that our approach outperforms the traditional stack-based structural join algorithms of [7] by several orders of magnitude in many cases. • We discuss how updates can aﬀect the various ancestor-descendant schemes used in structural join processing, and show how the work of Chapter 2 can reduce the negative side-eﬀects of updates upon XML databases for the two most popular methods in the literature for maintaining ancestordescendant information, using preorder/postorder identiﬁers and start/end identiﬁers. The rest of this chapter is organized as follows. Section 3.2 gives the problem deﬁnitions and discusses relevant related work. Section 3.3 presents the assumptions about the database system upon which this chapter is based. We present our improvements to existing structural join algorithms in Section 3.4. In Section 3.5, we give an experimental analysis of our algorithms, and compare our results with some existing schemes for structural joins. In Section 3.6, we discuss the insertion performance of ancestordescendant labeling schemes that are commonly used for structural join processing. Finally, Section 3.7 concludes the chapter.

3.2

3.2.1

Related Work

Ancestor-Descendant Labeling Schemes

As described in the introduction to this chapter, the ancestor-descendant problem has become one of the most important problems in XML databases. We now state the problem more precisely: Deﬁnition 3.1 (The Ancestor-Descendant Problem) The ancestor-descendant problem is to eﬃciently determine whether x is an ancestor of y, for any two nodes x and y in the database. At present, all ancestor-descendant schemes (and the only ones we will consider here), are labeling schemes, which assign to each node in the database some label, and deﬁne a comparison operator between labels that yields the ancestor-descendant test. For example, in [68], a document tree is viewed as a complete k-ary tree, where k is the maximum number of children of any node of the tree. “Virtual nodes” are used to make the tree complete. The identiﬁer of each node is assigned according to the levelorder tree traversal of the complete tree. It is then a simple matter to ﬁnd the ancestors and children of a node using just the identiﬁer. The problem with this approach is that as the arity and height of the complete tree increase, the identiﬁers may become huge. Also, if the constant k changes due to the insertion of new nodes, then all the identiﬁers must be recalculated from scratch. This makes the approach unrealistic for large, dynamic XML documents [70]. 34

In [66], a labeling scheme is used, such that the label of a node’s ancestor is a preﬁx of the node’s label. The idea is similar to the Dewey encoding [16] that can be used to check parent-child relationships easily. Using this method takes variable space to store identiﬁers, and the time to determine the ancestordescendant relationship is no longer constant, but linear in the length of the identiﬁer. The lack of a bound on the identiﬁer makes it diﬃcult to guarantee that such an index will be practically useful on large databases. Nevertheless, similar schemes have been considered recently in the literature, with equally bad worst case bounds. For instance, Deschler et al [32] devised a Dewey-like scheme, which utilized strings instead of numbers as entries. Insertions into the database could then be accommodated without having to relabel any surrounding nodes. We note that the worst case of this algorithm is consistent with the work of Cohen et al [25], with showed that any ancestor-descendant labeling scheme which did not relabel must in the worst case use identiﬁers linear in the size of the database. One of the most popular techniques in the literature for encoding ancestor-descendant information is the start/end technique [70, 98, 99]: Deﬁnition 3.2 (The Start/End Labeling Scheme) The start/end labeling scheme labels each node x in the database with a pair of numbers Startx , Endx , such that: • For all nodes x, Startx < Endx , and • For all nodes x and y, x is an ancestor of y if and only if Startx < Starty and Endy < Endx . In this scheme, each node is assigned two integer values, denoting a range. The ancestor-descendant relationship is then determined by checking whether the range of one node is contained in the range of the other. These ranges can be easily constructed by a preorder traversal of the XML document. A similar technique is the preorder/postorder technique of Dietz [36], discussed in Section 2.8 of the previous chapter: Deﬁnition 3.3 (The Preorder/Postorder Labeling Scheme) The preorder/postorder labeling scheme labels each node x with a pair of numbers P rex , P ostx , such that for all nodes x and y, P rex < P rey if and only if x comes before y in the preorder traversal of the tree, and P ostx < P osty if and only if x comes before y in the postorder traversal of the tree. Then, x is an ancestor of y if and only if P rex < P rey and P ostx > P osty . We brieﬂy note that with both the start/end and preorder/postorder labeling schemes, if we also store with each node its depth in the database, we can not only answer ancestor-descendant queries, but also parent-child queries. For this reason, along with the fact that the space requirements on the labels are guaranteed to be reasonably small, these schemes are the most commonly used in native XML databases. Updating the labels of the start/end scheme due to insertions has been considered before in the literature, in the TIMBER database [63] and by Amagasa et al [9]. In both cases, updates are handled updates by using real numbers instead of integers as tag values. This accommodates a ﬁnite number of updates, by naturally leaving gaps between successive values. However, if we use log n bits to represent the ﬂoating point number, then we can only handle at most n updates, in the worst case, before running out of space. Obviously, this is exactly the same bound as for the integers, which in practice is quite low (e.g., 32 for 32-bit machines), and hence frequent updates in one position in the database will still result in poor update performance. An alternative method of updating start/end schemes is to use the techniques of Chapter 2. In this chapter, we will frame our structural join algorithms in terms of the preorder/postorder scheme. In Section 3.6, we will investigate the update performance of both the start/end and preorder/postorder schemes. It should be emphasised that our structural join algorithms can work with ancestor-descendant labeling schemes other than the preorder/postorder scheme.

3.2.2

Structural Joins

Recently, several new algorithms, structural join algorithms, have been proposed for ﬁnding sets of document nodes that satisfy the ancestor-descendant relationship with another set of document nodes. Various approaches have been proposed both using traditional relational database systems [44, 99] and using native XML database systems, such as Lore [78] and TIMBER [63]. The structural join can be deﬁned as follows: 35

Algorithm 3.1: Slightly modiﬁed stack-tree based structural join as proposed in [7] (STJ-D). (NB: All algorithms in this chapter are simpliﬁed for ease of presentation: boundary cases are omitted and all boolean operations return false if a particular element does not exist or an index is out of range.) STACK-TREE-DESC(A, D) 1 a ← 0,d ← 0,R ← ∅,stack ← ∅ 2 while d < |D| ∧ a < |A| ∨ |stack| > 0 do 3 if FOLLOWING(TOP(stack), A[a]) ∧ FOLLOWING(TOP(stack), D[d]) then 4 POP(stack) 5 elif PREORDER(A[a]) < PREORDER(D[d]) then 6 PUSH(A[a], stack) 7 a←a+1 else 8 9 APPEND((s, D[d]), R), ∀s ∈ stack 10 d←d+1 FOLLOWING(n, f ) // Returns true if and only if f follows n in document order, and f is not a descendant of n. 1 return PREORDER(f ) > PREORDER(n) ∧ POSTORDER(f ) > POSTORDER(n) ANCESTOR(d, a) // Returns true if and only if a is an ancestor of d. 1 return PREORDER(d) > PREORDER(a) ∧ POSTORDER(d) < POSTORDER(a)

Deﬁnition 3.4 (The Structural Join) Suppose we have two sets of nodes N1 and N2 from an XML document, and a predicate IS-ANCESTOR(n1 , n2 ) which returns true if and only if n1 is an ancestor of n2 in the document. Then the structural join of N1 and N2 is: N1 ⋊ N2 = { n1 , n2 | ∀n1 ∈ N1 , ∀n2 ∈ N2 , IS-ANCESTOR(n1 , n2 )} ⋉ (The term structural join was coined in [7]; other equivalent terms have been used in the literature, most commonly the term containment join [99].) The structural join is an extremely useful operator in the evaluation of XML queries. For instance, consider the evaluation of the path expression //a//b[.//c]. If A, B, and C represent the sets of all a, b, and c elements respectively, then some ways of evaluating this using structural joins include: • πd (A ⋊ πa (B ⋊ C)) ⋉ ⋉ • πa (πd (A ⋊ B) ⋊ C) ⋉ ⋉ • πa (πd (A ⋊ B) ⋊ πd (A ⋊ C)) ⋉ ⋉ ⋉ In the above, πa and πd are projection operators which extract the ancestor and descendant, respectively, from each binary tuple outputed by the join. Which of the above plans is optimal depends upon the selectivity of the query and its sub-queries — the problem of choosing an appropriate plan has already been studied in the literature by Wu et al [97]. Halverson et al [58] showed that structural join operations should be considered in conjuction with other access operators, such as navigational primitives, and hence structural joins are not the complete solution to XML query processing, but are still an important component. Zhang et al [99] were some of the ﬁrst to note the importance of the structural join, which they implemented using a merge join algorithm, the multi-predicate merge join. They found that their implementation of this join operator sped query evaluation up by an order of magnitude over more traditional relational techniques. However, as shown by Al-Khalifa et al [7], their join technique can cause a lot of unnecessary I/O during evaluation. Another early merge join based algorithm was presented in [70], which also causes excessive I/O. The current state of the art structural joins on XML data is described in [7]. It takes as input two lists of elements, both sorted with respect to document order, representing the list of ancestors (which 36

we will denote as the AList) and the list of descendants (which we will denote as the DList). We give the algorithm in Algorithm 3.1. The basic idea of the algorithm is to perform a merge of the two lists to produce the output, by iterating through them in document order. While it is iterating through the two lists, it determines the ancestor-descendant relationship between the current top item in a stack, which is maintained during the iteration, and the next node in the merge. The stack is the list of all ancestors of the current ancestor being considered — clearly, if any descendant matches the current ancestor node, then it must also match all the nodes in the stack. The algorithm, which we will refer to here as the STJ-D algorithm, is the Stack-Tree-Desc algorithm of [7], which returns the results sorted in document order with respect to the descendants. The authors also present in [7] a variant which sorts with respect to ancestors; as this variant is slower, we will not use it as a basis of comparison with our techniques. We summarize the correctness and theoretical optimality of this algorithm in the following theorem: Theorem 3.1 (Theorem 3.1 of [7]) Algorithm 3.1 ﬁnds all ancestor-descendant pairs from the input lists, and returns them sorted in document order with respect to the descendants, in total time O(|AList|+ |DList| + |OutputList|). More recent work extended this approach for improved speed. For example, the XML Region Tree (XR-Tree) approach to storing XML documents [57], uses a variant of a B + tree with modiﬁed index key entries and lists to maintain the ancestor-descendant relationship between nodes. It then uses the stack-tree based join algorithm to carry out structural joins. The amortized I/O cost for inserting and deleting nodes for an XR-Tree is O(logF N + CDP ), where N is the number of elements indexed, F is the fanout of the XR-tree, CDP is the cost of one displacement of a stabbed element (see [57] for a fuller description of CDP ). Although this approach can support structural joins of XML data by using the stack-tree based join, determing ancestor-descendant relationship between the two nodes is not constant. Therefore, for large ancestor and descendant node sets it may not be as eﬃcient as the original STJ-D algorithm. Also, any large set of random updates requires frequent updates of large parts of the XR-tree. Therefore, maintaining the index for a large, changing XML database can be costly. In [14], the stack-tree join algorithm was extended to match more general selection patterns on XML data tree. The work done by [7, 14, 23, 58] are closely related to this chapter. For instance, the B + algorithm proposed in [23] speeds up the stack-tree based join algorithm by using a combination of prebuilt indices such as a B + tree and an R-Tree. It utilizes B + tree indices built on the element starttag positions, and hence use B + tree range queries to skip descendants that do not match ancestors in the structural join. However, these approaches are not eﬀective in skipping ancestor nodes, as stated in [57].

3.3

Architectural Assumptions

One of the advantages of our approach is that we make very few assumptions about the underlying database architecture. In fact, for our structural join algorithms to work, we must only posit a few rules for the database: 1. We assume that we have a solution to the ancestor-descendant labeling scheme. In this section, all of our algorithms will be framed in terms of the preorder/postorder labeling scheme discussed above. We will defer discussion of why we chose this ancestor-descendant labeling scheme until Section 3.6. In any case, any other ancestor-descendant labeling scheme could take its place if necessary. 2. We assume that the inputs sets N1 and N2 to a structural join N1 ⋊ N2 are sorted in document ⋉ order. 3. We assume that we can access the input sets N1 and N2 to a structural join N1 ⋊ N2 as an array ⋉ (that is, we assume random access to the input sets). In fact, this condition is a little strong — we merely need a mechanism to skip forward or backward through the list, where the skip size is approximately of the size that is requested. More precisely, we do not strictly need to skip forward n elements when requesting a skip of size n, although in this chapter we will assume for simplicity that this is the case. We note that the ﬁrst two rules given above are a requirement for all current structural join algorithms, and that the only speciﬁc extra assumption we have made is the third. However, this assumption is more 37

reasonable than the assumptions made by other schemes (such as [57]), and hence our scheme is second only to Al-Khalifa et al’s [7] in generality. We believe that the third condition is not an onerous one, and that most XML database systems can satisfy it, primarily for the following reasons: • The input sets to the structural join algorithm, even if they contain millions of nodes, in practice are only a few megabytes large. This means that structural joins are generally carried out in memory and hence it would be easy to store the input sets in an array. • In some new systems, such as the Niagara system [58], structural joins are performed directly over more complicated index structures such as B-trees. Even in these cases, our methods can be applied, by performing the skips where possible. For instance, in many B-tree implementations it would still be possible to skip forward and backward within a data page, and it may even be possible to perform some limited forms of inter-page skipping. Although we do not give details in this chapter, our techniques can be easily adapted to these situations. • Many query evaluation engines use the iterator principle, where results from subexpressions are evaluated one result at a time, with the results being gradually fed into higher layers. While our skipping strategies are less applicable in this situation than in others, it should be noted that skipping can still be performed in some cases, although the exact details are highly dependent on the set of query operators implemented. Nevertheless, we believe our methods can still be used in these situations. In fact, as in Halverson et al [58], we believe that the evaluation cost of the skipping strategies should be taken into account during query plan selection, as plans where skipping can be used may be substantially faster than plans which cannot use skipping due to the choice of operators.

3.4

Skip Joins

This section presents our algorithms to skip unmatched nodes for structural joins. It also describes various strategies that can be used for skipping, which lead to diﬀerent performance characteristics. We ﬁrst classify all structural joins into three classes — each class of structural joins has diﬀerent applications for query optimization, and are suﬃciently diﬀerent that they should be optimized separately. While previous works have considered variants of structural joins that sort in diﬀerent ways, this is the ﬁrst work to specialize the structural join operator for cases where only the ancestor or descendant component of the result is necessary. 1. Ancestor-Descendant Join (AD-Join): the ﬁrst type of structural join, which has previously been considered in the literature, returns the set of ancestor-descendant node pairs. For example, the query //a[.//b=.//c] might be evaluated by ﬁrst evaluating //a//b and //a//c using an AD-Join, and then returning the appropriate ancestors by merging the results of these two joins based upon equality of descendants. Hence, in each of the two joins, we would use the set RAD = { a, d | ∀a ∈ A, ∀d ∈ D such that a is an ancestor of d}. 2. Ancestor Join (A-Join): this type of structural join ﬁlters a set of ancestor nodes by selecting only those nodes that have a descendant within the set of potential descendants. For example, for the query //a[.//b], an A-Join should return the node set RA = {a ∈ RA | ∃d ∈ D such that a is an ancestor of d}. 3. Descendant Join (D-Join): the third type of structural join ﬁlters a set of descendant nodes by selecting only those nodes that have an ancestor within the set of potential ancestors. For example, for the query //a//b, a D-Join should return the node set RD = {d ∈ RD | ∃a ∈ A such that a is an ancestor of d}. The B + algorithm proposed in [23] suggested that the state of the art STJ-D algorithm proposed in [7] has the disadvantage of having to scan through the entire ancestor list for the join operation, and hence in some cases unnecessarily scans through ancestor nodes that do not contain any nodes in the descendant list. A similar phenomenon can occur during the scanning of the descendant list. Their solution to this was to use a B + tree, which requires a prebuilt index system for the database. In this section, we introduce a modiﬁcation to the STJ-D algorithm which utilizes a skipping mechanism to skip 38

a−skip

ma−skip

ALIST

DLIST md−skip 0 1 2 3 d−skip 4 5 6 7 8 9 10 11 12

is ancestor of

Figure 3.1: Possible skipping strategies during a structural join ancestor and descendant nodes that do not match, and hence may be safely ignored during the structural join. Figure 3.1 represents a particular instance of a structural join. The circled regions denote the ancestordescendant relationship between nodes. For example, {d0 , d1 , d2 } is the set of descendants of a0 . Depending on the type of query, we can utilize diﬀerent skipping mechanisms to optimize the join. If we had a priori knowledge about the distribution of the nodes in the input sets, then the best outcome we could achieve in each case would be as follows: 1. AD-Join: For an AD-Join, in the best case we would ﬁnd the result set without ever considering the nodes which are not included in this set. If we were scanning through the two sets, then the a-skip (ancestor skip) and d-skip (descendant skip) arrows in the ﬁgure indicate the skips through the list that should be executed. 2. A-Join: For an A-Join, we can compute the result set by skipping even more nodes than in the AD-Join case. For example, a0 matches the nodes d0 , d1 , and d2 , but for an A-Join we do not need to count each of these occurrences. Once we have matched a0 against d0 , the nodes d1 and d2 contribute no further useful information, and hence they can be skipped. In the ﬁgure, we indicate the nodes that should be skipped (in addition to the nodes skipped during an AD-Join) by the md-skip (matched descendant skip) arrows. 3. D-Join: For a D-Join, the situation is exactly the same as in the A-Join, except that we can skip additional ancestors instead of descendants. For instance, once we have matched a6 against the set of descendants {d6 , d7 , . . . , d11 }, we need not consider the nodes a7 and a8 , which have no additional descendants (this situation arises when a7 and a8 are descendants of a6 ). We have denoted the nodes to be skipped in this case using the ma-skip (matched ancestor skip) arrows. The set of nodes which can be skipped varies drastically depending upon the input sets to the structural join, and it is clear that we do not have a priori knowledge of what nodes to skip. Our goal is to instead approximate the “perfect” skips we have illustrated above, with a series of imperfect jumps. If after we have made a skip, we ﬁnd we have gone too far, then we can skip back; if, on the other hand, we ﬁnd we have not gone far enough, then we can continue skipping. We will now discuss three extensions of the STJ-D algorithm (one for each of the above cases), which incorporate a generic skipping mechanism. As we have described above, there are four kinds of skips that, at various stages, we will need to perform. We give the precise deﬁnitions of these skipping functions in Algorithm 3.2. The reader may wish to verify that these deﬁnitions make sense in the context of our discussion of Figure 3.1. The deﬁnitions in Algorithm 3.2 are purely theoretical — if one were to implement the skipping functions according to those deﬁnitions, it is unlikely one would gain anything over the standard structural join. Instead, we will ﬁrst deﬁne our variants of the STJ-D algorithm 39

Algorithm 3.2: Skipping functions used in skip joins A-SKIP(a, d, A) // Skip from a to the ﬁrst a′ ∈ A following a such that a′ is either an ancestor of d, or follows an // ancestor of d 1 return minPREORDER(a′ ) {a′ ∈ A | PREORDER(a′ ) > PREORDER(a), ANCESTOR(d, a′ ) ∨ FOLLOWING(d, a′ )} D-SKIP(d, a, D), MD-SKIP(d, a, D) // Skip from d to the ﬁrst d′ ∈ D following d such that d′ is either a descendant of a or follows a // descendant of a 1 return minPREORDER(d′ ) {d′ ∈ A | PREORDER(d′ ) > PREORDER(d), ANCESTOR(d′ , a) ∨ FOLLOWING(a, d′ )} MA-SKIP(a, s, A) // Return the ﬁrst a′ ∈ A following a which is not a descendant of a 1 return minPREORDER(a′ ) {a′ ∈ A | FOLLOWING(s, a′ )}) MD-SKIP(d, a, D) // Return the ﬁrst d′ ∈ D following d which is not a descendant of a 1 return minPREORDER(d′ ) {d′ ∈ D | PREORDER(d′ ) > PREORDER(d), ¬ANCESTOR(d′ , a)})

a

a−skip

a

Stack d

(a) Skipping ancestors

Stack d d−skip (b) Skipping descendants

Figure 3.2: Skipping scenarios for an AD-Join assuming we have an eﬃcient implementation of the skipping functions, and we will turn to the problem of ﬁnding these eﬃcient implementations in Section 3.4.4.

3.4.1

Skip Joins for Ancestor-Descendant Joins

Our ﬁrst extension, which performs a structural join and returns ancestor-descendant node pairs, is the most straightforward of the three. The pseudocode is given in Algorithm 3.3 — as can be seen, only lines 7 and 13, where the cursor is advanced in the input streams, have been modiﬁed to instead skip ahead to the next matching node. As in the original STJ-D algorithm, we maintain a stack of nested ancestors, so that if one of them matches a descendant, all of them match it. We continue to maintain the stack in this variant of a skip join, because we must return all matching ancestor-descendant pairs. The two situations in which we can skip are illustrated in Figure 3.2. First, consider the situation of Figure 3.2(a). With the original STJ-D algorithm, each of the ancestor nodes under the dotted line would ﬁrst be pushed onto the stack at line 7 of Algorithm 3.1, and popped back oﬀ the stack at line 5 of Algorithm 3.1 in the very next iteration. We can minimize this excessive pushing and popping by instead skipping ahead to the ﬁrst matching ancestor of the current descendant node. The second situation, in Figure 3.2(b) occurs when we advance to a descendant whilst we have an 40

Algorithm 3.3: A skip join which returns ancestor-descendant node pairs (AD-Join). SKIP-JOIN-AD(A, D) 1 a ← 0,d ← 0,R ← ∅,stack ← ∅ 2 while d < |D| ∧ a < |A| ∨ |stack| > 0 do 3 if FOLLOWING(TOP(stack), A[a]) ∧ FOLLOWING(TOP(stack), D[d]) then 4 POP(stack) 5 elif PREORDER(A[a]) < PREORDER(D[d]) then 6 PUSH(A[a], stack) 7 a ← A-SKIP(a, D[d], A) 8 else 9 APPEND((s, D[d]), R), ∀s ∈ stack if |stack| > 0 then 10 11 d←d+1 12 else 13 d ← D-SKIP(d, A[a], D)

Algorithm 3.4: A skip join which returns only matched ancestor nodes (A-Join). SKIP-JOIN-A(A, D) 1 a ← 0,d ← 0,R ← ∅,stack ← ∅ 2 while d < |D| ∧ a < |A| ∨ |stack| > 0 do 3 if FOLLOWING(TOP(stack), A[a]) ∧ FOLLOWING(TOP(stack), D[d]) then 4 POP(stack) 5 elif PREORDER(A[a]) < PREORDER(D[d]) then 6 PUSH(A[a], stack) 7 a ← A-SKIP(a, D[d], A) 8 else 9 APPEND(s, R), ∀s ∈ stack 10 d ← D-SKIP(d, A[a], D) 11 if |stack| > 0 then 12 d ← MD-SKIP(d, TOP(stack), D) 13 else d←d+1 14

empty ancestor stack. The empty stack indicates that we have no potential ancestors for this descendant, and hence we can skip ahead until we ﬁnd a descendant of the current ancestor. It should be pointed out that although it is possible to skip nodes in D even when the stack is not empty, the performance gain rarely covers the penalty of the overhead. This is because, in practice, real world XML trees have very shallow depth, and hence the number of skippable nodes within nested regions is generally small.

3.4.2

Skip Joins for Ancestor Structural Joins

Many XML queries require the eﬃcient ﬁltering of ancestor nodes. For example, the query //a//b[.//c] returns a set of b nodes which all have an ancestor a and a descendant c. To process this query using a structural join algorithm, we must ﬁrst join //a//b, then //b//c and ﬁnally merge the two joins together. If we use the STJ-D algorithm, then each structural join returns ancestor-descendant pairs. However, in the join for //a//b, we can throw away the ancestor part of each tuple; similarly, in the join for //b//c, we can throw away the descendant part. While this seems trivial, the point at which the ancestors or descendants are thrown away is critical. For instance, in the case where a single b node joins with multiple c nodes, the size of the join result is proportional to the number of c nodes. However, if we throw away the multiple descendants while we are doing the join, then the result size drops to just one. The beneﬁts of this technique compound as structural joins are composed. 41

Algorithm 3.5: A skip join which returns only matched descendant nodes (D-Join). SKIP-JOIN-D(A, D) 1 a ← 0,d ← 0,R ← ∅,s ← φ 2 while d < |D| ∧ a < |A| ∨ s = φ do 3 if FOLLOWING(s, A[a]) ∧ FOLLOWING(s, D[d]) then 4 s←φ 5 elif PREORDER(A[a]) < PREORDER(D[d]) then 6 if s = φ then 7 a ← MA-SKIP(a, s, A) 8 else 9 s ← A[a] a ← A-SKIP(a, D[d], A) 10 11 else 12 APPEND(D[d], R) 13 if s = φ then 14 d←d+1 15 else 16 d ← D-SKIP(d, A[a], D)

AList

AList

DList

(a) Binary skipping strategy for A-SKIP

DList

(b) Exponential skipping strategy for A-SKIP

Figure 3.3: Skipping strategies for A-SKIP In this section, we will consider the improvements to Algorithm 3.3 that can be made in the case where we only need the ancestors, and not the descendants. The pseudocode for the modiﬁed algorithm is given in Algorithm 3.4. The modiﬁcation occurs in the case when the stack is not empty, and we have just matched a descendant node: since we do not care about further descendants which match the current ancestor node, we can perform a matched descendant skip to ﬁnd the next non-matching descendant. In situations where there a large number of matching descendants (which can happen in many practical situations), this can result in a huge reduction in result set size.

3.4.3

Skip Join for Descendant Structural Join

The modiﬁed version of Algorithm 3.3 which returns only descendants is given in Algorithm 3.5. In this case, we can completely rid ourselves of the stack of ancestor nodes, as keeping only the top most ancestor will yield the same result set. Hence, as soon as we ﬁnd a matching ancestor-descendant pair, we can immediately perform a matched ancestor skip to move past all other ancestors which match the current descendant node. We expect this type of structural join to perform well regardless of the size of the ancestor and descendant lists, due to the fact that the maintenance of the stack is no longer necessary.

3.4.4

Skipping Strategies

Until now, we have deferred discussion of the actual implementation of the skipping functions listed in Algorithm 3.2. As discussed previously, without a priori knowledge of the structure of the input sets, it is not possible for us to implement these skipping functions exactly in all situations. Hence, we must instead rely on approaches which converge on the correct node gradually — if the convergence is fast 42

Algorithm 3.6: Binary skipping of ancestor nodes A-BINARY-SKIP(minA , maxA , d, A) 1 while minA ≤ maxA do 2 x ← ⌊ minA +maxA ⌋ ¯ 2 3 if POSTORDER(A[¯]) > POSTORDER(d) then x 4 if POSTORDER(A[¯ − 1]) < POSTORDER(d) then x 9 return x ¯ 10 else 11 maxA ← x − 1 ¯ 12 else 13 if PREORDER(A[¯]) > PREORDER(d) then x 14 maxA ← x − 1 ¯ 15 else 16 minA ← x + 1 ¯ 17 return |A|

enough, then the extra eﬀort required to skip was worth it. In this section, we will consider two diﬀerent skipping strategies, using a simple binary search, and an exponential technique similar to the “slow start” found in TCP/IP. In the interests of brevity, we will restrict our discussion to the A-SKIP function, but the ideas are easily adapted to the other three skipping functions. Binary Skipping Here we propose a skipping strategy which uses a simple binary search, as described in Algorithm 3.6. As illustrated in Figure 3.3(a), the function uses a standard binary search (with the upper bound being the end of the set) to traverse the ancestor list in order to ﬁnd a node satisfying the required structural condition. Clearly, if the list has n nodes, then the binary skipping strategy takes on average O(log n) skips to ﬁnd the appropriate node (note, however, that the number of skips decreases as the cursor nears the end of the list). We believe this approach is eﬃcient for processing queries such as //Month[text()="03"]: in this case, even if there exists a large set of Month elements, there may only be a small subset of them that matches the predicate. This yields a large ancestor to descendant node ratio and, in most cases, large sections of the ancestor node list need not be visited at all. The primary disadvantage of this skipping strategy is that the distance between the ﬁrst few accesses are spread apart, and hence there is little locality of reference. This could cause disk page and cache misses on large input sets. Exponential Skipping As the average number of skips for the binary skipping approach is O(log n), the worst case for the binary skipping strategy is if most of the nodes in the ancestor node list are matched. In this case, then every call to the skipping function would require approximately log n skips to ﬁnd the next node, when the best strategy would be to simply advance the cursor by one node. To alleviate this problem, we propose an exponential skipping strategy, similar to the “slow start” of TCP/IP. The exponential skipping strategy ﬁrst skips through the list using exponentially increasing gaps, e.g. {1, 2, 4, 8, 16, . . .}. When it overshoots the target node, we then switch to binary search with the boundaries of the search set to the start and end position of the last exponential skip made. The pseudocode is described in Algorithm 3.7.

3.4.5

Skipping For Streaming Data

Chien et al [23] mention that it is not possible for any prebuilt indices to perform faster than sequential scan based structural join algorithms for streaming data, because they cannot send any feedback to the producer modules. As we have shown, our algorithms do not use any prebuilt indices for node skipping. Therefore, we can easily adapt our techniques to suit the on the ﬂy join strategies needed for streaming data, based on the assumption that the incoming streams are in document order. For processing streaming 43

Algorithm 3.7: Exponential skipping of ancestor nodes A-EXPONENTIAL-SKIP(a, d, A) 1 minA ← a + 1,maxA ← |A| − 1,δ ← 1,¯ ← minA x 2 while x < |A| do ¯ 3 if POSTORDER(A[¯]) < POSTORDER(d) then x 4 minA ← x ¯ 5 x←x+δ ¯ ¯ 6 δ ← 2δ 7 else 8 maxA ← x ¯ 9 break 10 return A-BINARY-SKIP(minA , maxA , d, A)

Name DBLP MEDLINE XMark

Number of Elements 3,803,281 2,768,743 2,921,323

Size (MB) 160 130 204

Depth 6 7 12

Table 3.1: Properties of the experimental data sets data, we set a ﬁxed buﬀer size for the stream input, and page size proportional to the buﬀer size, thus simulating the same environment we would normally have for skip joins. In the event the current position is under the high boundary, we just load in more pages from the buﬀer pool, until it passes the buﬀer size. Then, we set the current position to the buﬀer size and do a sanity check on whether we have skipped pass the desired ancestor or descendant node. If not, we ﬂush the buﬀer and load in more data from the stream.

3.5

Experimental Results

In this section, we present our experimental results on the performance of our structural join algorithms on both real-world and synthetic XML data sets. We compare the performance of all join algorithms proposed in this chapter with the original Stack-Tree-Desc algorithm described in [7].

3.5.1

Experimental Setup

The experiments were carried out on a machine with dual Intel Itanium 800MHz processors, 1 GB of RAM, and a 40 GB SCSI hard-drive. The machine ran the Debian GNU Linux operating system with the 2.4.20 SMP kernel. The data set for our experiments consisted of the real world data sets DBLP [30] and MEDLINE [79], and a data set randomly generated by XMark [88]. The structure and size of these data sets is detailed in Table 3.1. Table 3.2 lists the join algorithms we will compare in our experiments and the abbreviations with which we will refer to them in the results of this section. We implemented the join algorithm using the XPath processor from the SODA XML database engine (available from http://www.sodatech.com). In order to increase the accuracy of the experiments, we disabled all database indexing and we implemented our join algorithm and the STD-J algorithm using exactly the same code base in C. For each of the experiments, we ﬁrst scanned the database to ﬁlter out all elements which do not ﬁt the required element name before the structural join. Both the ancestor and the descendant lists, along with their ordering information, were stored in memory and no swapping to disk was performed during the experiment. Finally, we also only measured the time spent on the structural join algorithm itself. We deﬁned a set of XPath expressions that capture a wide variety of real-life access patterns, which are listed in Table 3.3, along with the number of nodes that satisfy each XPath expression. Our experiments 44

Abbreviation STJ-D-Join AD-Joine A-Joine D-Joine AD-Joinb A-Joinb D-Joinb

Algorithm Stack-Tree-Join-Desc[7] Skip-Join-AD with exponential skipping strategy Skip-Join-A with exponential skipping strategy Skip-Join-D with exponential skipping strategy Skip-Join-AD with binary skipping strategy Skip-Join-A with binary skipping strategy Skip-Join-D with binary skipping strategy

Table 3.2: Abbreviations for algorithms Set A1 A2 A3 A4 A5 A6 A7 A8 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 Database DBLP DBLP DBLP DBLP MEDLINE XMark XMark XMark DBLP DBLP DBLP DBLP DBLP DBLP MEDLINE MEDLINE XMark XMark XMark Query //dblp //article //inproceedings /*/*/* //MedlineCitation //listitem //keyword //bold //title[.="The Asilomar Report on Database Research."] //author[.="Jeffrey D. Ullman"] //author /*/* //sup /*/*/*/*/sup //Year //Year[.="2000"] //listitem //keyword //bold Table 3.3: Node sets used in the experiements joined pairs of the result sets listed in this table together using a structural join (of course, we only joined together results set which came from the same underlying data set). For example, we used the expression A1 //D1 (as deﬁned in the table) to compute the result of the path expression //dblp//title[.="The Asilomar Report on Database Research."]. Set Size 1 128,533 240,685 3,424,646 30,000 106,508 122,924 125,958 1 227 820,037 375,225 1,155 50 92,624 5,426 106,508 122,924 125,958

3.5.2

Results and Observations

The full experimental results are presented in Table 3.4. As described above, each query consisted of evaluating a structural join of one of the ancestor sets and one of the descendant sets from Table 3.3. Columns |A| and |D| give the size of the two input sets being joined, and column |R| gives the size of the output of the join operations. As we have discussed in this chapter, there are three diﬀerent types of structural joins for XML data: an AD-Join returns ancestor-descendant node pairs, an A-Join returns matching ancestor nodes only and a D-Join returns matching descendant nodes only. We evaluated the performance of each of these joins, using both binary and exponential skipping strategies. For example, in query Q1 , column A-Joine gives the time it took to execute the query //article[.//title[.="The Asilomar Report on Database Research."]] using the exponential skipping strategy. Column D-Joine gives the time it took to execute //article//title[.="The Asilomar Report on Database Research."]. Column AD-Joine gives the time of a join for the same two input sets, but in this case returning both ancestors and descendants — this would be useful when composing structural joins for more complicated queries, such as those found in XQuery. 45

Q Q1 Q2 Q3 Q4 Q5 Q6 Q12 Q7 Q8 Q9 Q10 Q11 Q Q1 Q2 Q3 Q4 Q5 Q6 Q12 Q7 Q8 Q9 Q10 Q11

A A2 A3 A3 A1 A4 A4 D1 A5 A5 A6 A8 A7 A A2 A3 A3 A1 A4 A4 D1 A5 A5 A6 A8 A7

D D1 D2 D3 D4 D5 D6 A2 D7 D8 D9 D10 D11 D D1 D2 D3 D4 D5 D6 A2 D7 D8 D9 D10 D11

Set Cardinality |A| |D| |R| 128,533 1 1 240,685 227 116 240,685 820,037 557,868 1 375,225 375,225 3,424,646 1,155 1,155 3,424,646 50 50 1 128,533 0 30,000 92,624 92,624 30,000 5,426 5,426 106,508 106,508 39,244 125,958 122,924 6,636 122,924 125,958 7,485 Set Cardinality |A| |D| |R| 128,533 1 1 240,685 227 116 240,685 820,037 557,868 1 375,225 375,225 3,424,646 1,155 1,155 3,424,646 50 50 1 128,533 0 30,000 92,624 92,624 30,000 5,426 5,426 106,508 106,508 39,244 125,958 122,924 6,636 122,924 125,958 7,485

STJ-D 66,518 119,747 450,257 197,807 1,754,825 1,742,093 44,349 72,243 21,567 77,193 117,640 116,578 STJ-D 66,518 119,747 450,257 197,807 1,754,825 1,742,093 44,349 72,243 21,567 77,193 117,640 116,578

Time Taken (µs) AD-Joine A-Joine 131 139 1,197 1,224 470,990 357,005 200,628 224 14,374 14,501 2,796 2,943 331 351 75,152 76,683 8,137 7,776 85,852 84,167 103,772 102,221 103,638 101,965 Time Taken (µs) AD-Joinb A-Joinb 117 122 2,359 2,434 815,160 1,528,636 205,983 235 28,833 29,855 3,237 3,246 309 380 114,355 152,286 20,762 26,058 180,529 187,655 337,705 342,628 342,435 338,975

D-Joine 145 1,186 313,501 169,168 13,984 2,806 348 48,895 6,823 63,065 98,883 99,464 D-Joinb 118 2,391 1,485,116 167,333 29,504 3,231 381 147,618 25,253 109,155 219,683 218,537

Table 3.4: Runtime for diﬀerent structural joins on diﬀerent queries There are many interesting points that can be observed about the results of Table 3.4: • There is a close correlation between the ratio of the sizes of the input sets upon which skipping is performed and the result set, and the performance of the corresponding skip join algorithm. For |R| AD-Joins, the performance is correlated to the ratio max{|A|,|D|} ; for A-Joins, |R|/|A|; and for D-Joins, |R|/|D|. The lower these ratios are, the greater the speed up over the traditional STJ-D algorithm. Indeed, in queries Q1 , Q2 , Q5 , Q6 and Q8 , we see improvements of up to two orders of magnitude! This is not surprising, given that if the size of the output is much smaller than the size of the input sets, then a high proportion of the input sets must not be included in the output, and hence there are many skippable nodes. We believe that in practice, there are realistic queries which result in almost any value for these ratios, and hence our skip join algorithms are a useful join algorithm for many cases in practice; at the same time, we expect the STJ-D to also be useful in cases, especially when the ratios are high. • In queries where skipping does not perform signiﬁcantly better than the STJ-D algorithms (queries Q3 , Q4 , Q7 , Q9 , Q10 , and Q11 ), we ﬁnd that the skipping algorithms are consistently slightly slower than the STJ-D algorithm. The reason there is no speed up in these cases is because there are fewer opportunities for skipping, because the result set size is roughly the same size as the input sets. In the case of Q3 and Q4 , the AD skip join is outperformed by the STJ-D join by a small percentage of approximately 4%. The Q3 query evaluates //inproceedings//author on DBLP; in DBLP, both inproceedings and author are very frequent, and within every inproceedings element, there is always a minimum of one author element. Therefore, skipping through ancestor or descendant lists is useless for this query. As a result, the skipping algorithms reduce to the behavior of STJ-D. The extra time taken is due to the unnecessary skips. Q4 evaluates //dblp//*, which has an extremely small ancestor list of only one element, and an extremely large descendant list (the entire database). 46

Again, in this scenario, skipping is not useful and the extra operations become an overhead that STJ-D does not have. However, as can be seen from the results, the overhead is only 4%, which is still quite acceptable given the gains on other queries. We also note that both the A-Join and D-Join are actually faster than STJ-D, mainly because the results are only single nodes, and hence there is reduced usage of the stack (depending on the number of input ancestor nodes). Similar results hold for Q7 . The queries Q9 , Q10 and Q11 are performed on random data sets created by XMark. The data set is highly nested, with the practically rare and unnatural property that two distinct element names interleave with each other multiple times on a single path (e.g. //keyword//bold//keyword//bold). |D| Note the ratio of |R| on Q9 is two, which means that, on average, mismatched descendant nodes and matched nodes interleave each other. This pattern makes skipping diﬃcult and hence the algorithms yield similar performance to that of an STJ-D join. • It is clear that the manipulation of the stack required by AD-Joins and A-Joins add signiﬁcant overhead to the structural joins. For example, in Table 3.4, in queries where all four join algorithms have comparable performance, the D-Join consistently outperforms all other types of skip joins, sometimes by a signiﬁcant margin. This highlights the advantage of considering separate types of joins depending upon what part of the result is needed. • In all queries except Q1 , the binary skipping strategy is noticeably slower than the exponential skipping strategy. This is obviously due to the O(log n) average case of the binary skipping strategy, which in practice the exponential skipping strategy does not suﬀer from. In query Q1 , which has a result size of only 1, the superior performance of the binary skipping strategy can be attributed to the fact that it allows faster skipping through the sets. We do not expect this case to arise in practice, and hence suggest that the exponential skipping strategy be used in all cases.

3.5.3

Summary

To summarise our experimental results, our proposed skip join algorithms performed very well for Q1 , Q2 , Q5 , Q6 , Q8 and Q12 , where the result sets are small in size and there are large diﬀerences in size between the ancestor and descendant lists. Compared to the STJ-D algorithm, we were able to achieve a performance improvement of up to three orders of magnitude for the above queries. In general, the times for an A-Join or D-Join are faster than for the STJ-D algorithm, and in most cases faster than for an AD-Join. Therefore, we recommend the query optimizer should utilize A-Joins and D-Joins where possible. However, the AD-Join’s performance is still comparable to the STJ-D join algorithm for most queries where large result sets are returned. In the case of Q1 , Q2 , Q5 , Q6 , Q8 and Q12 , an AD-Join was still able to outperform an STJ-D join by several orders of magnitude. The experimental results also show that the exponential skipping strategy outperforms the binary skipping strategy, and that therefore we should adopt the exponential skipping strategy as the default for future implementations.

3.6

Label Maintenance for Changing Data

All structural join algorithms discussed in this chapter rely on a fast method of determining ancestordescendant relationships. We have not as yet discussed ways of providing these methods, whilst eﬃciently handling updates. The work of Chapter 2 examined the issue of eﬃciently handling document ordering in the presence of updates, and brieﬂy showed how this work could be applied to the ancestor-descendant problem. Of all the ancestor-descendant labeling schemes we discussed in Section 3.2.1, there are two strategies in particular which we will discuss that can handle updates eﬃciently. In the ﬁrst strategy, we maintain both preorder and postorder identiﬁers for each node in the database. As described in Dietz’s work [36], the ancestor-descendant relationship between two nodes can then be determined by comparing these. For example, a node x is an ancestor of a node y if and only if pre(x) < pre(y) and post(x) > post(y). In the second strategy, each node is assigned a start and end identiﬁer, such that node x is an ancestor of node y if and only if start(x) < start(y) and end(x) > end(y). While we described other methods in the related work, they all suﬀer from either excessively long labels or poor update performance. In contrast, these two methods have both good update performance (as shown in Chapter 2) and only require 2 log n bits per node (where n is the number of nodes in the database). 47

1 2 3 1 4 4 2 5 3 7 5

13 10 9 7 11 9 12 12 10 13 11

6 8 8 6

Figure 3.4: A complete tree (k = 3, d = 3). The numbers to the left of each vertex are the preorder identiﬁers; to the right are the postorder identiﬁers. In this section, we will contrast these two ancestor descendant labeling schemes in terms of update cost. We will perform this comparison using both an informal theoretical argument and experimental results.

3.6.1

Theoretical Analysis

Updates in either of these two strategies can be reduced to the order maintenance problem we studied in Chapter 2. For the preorder/postorder approach, we maintain two distinct lists of size n, where n is the number of nodes in the database. For the start/end approach, we maintain one distinct list of size 2n. We showed in the last chapter that the constant time solutions to the order maintenance problem can be outperformed by the O(log n) solutions in heavy read situations. Hence, it is likely that the logarithmic solution will be used in practice. As the O(log n) solution is likely to be used, it appears that the start/end approach is at a disadvantage to the preorder/postorder approach, due to the fact that it is maintaining a larger list. That is, the cost of maintaining the start/end list will be 2C log 2n for some constant, but for the preorder/postorder lists will be 2C log n. However, one possible disadvantage of the preorder/postorder approach is that it is not clear whether the updates in the preorder list will overlap with the updates in the postorder list. To be more precise, it is not clear whether the set of pages loaded in for the preorder relabeling will overlap with the set of pages loaded in for the postorder relabeling. We now provide an informal justiﬁcation as to why performing preorder and postorder traversals at the same time will generally have less than twice the swapping penalty of performing one of the traversals. We will give our argument in terms of complete, uniform trees, which are of depth d, where each non-leaf node has exactly k children, and where each level is completely ﬁlled with nodes. Such trees, where k is much larger than d, model real-world XML documents quite closely. Figure 3.4 gives an example of such a tree, including the two numeric identiﬁers attached to each node that are of interest to us, pre(x) the preorder identiﬁer, and post(x), the postorder identiﬁer. We now give some properties of such trees, useful in our justiﬁcation. Lemma 3.1 pre(x) − post(x) is an invariant for all x on the same level of the tree. Proof: Let x be any node on the same level (except the last node of the level), and y be its successor. In preorder traversal, all the nodes of x’s subtree lie between x and y. In postorder traversal, all the nodes of y’s subtree lie between x and y. Because the tree is complete, these quantities are equal. Hence, pre(y) = pre(x) + α and post(y) = post(x) + α, so pre(y) − post(y) = pre(x) − post(x). Lemma 3.2 For the mth node on the ith level, we have pre(x) = i + (m − 1) k d−i+1 −1

k−1

.

Proof: The ﬁrst node in each level must be the ith node in preorder traversal. Moreover, we saw in the proof of Lemma 3.1 that the increment for pre(x) and post(x) between nodes on the same level is ﬁxed. Hence, we have pre(x) = i + (m − 1)α, which x is the mth node on level i. It is clear that α is equal d−i+1 d−α+1 to the number of nodes in a subtree of depth d − α + 1 = k k−1−1 . Hence, pre(x) = i + (m − 1) k k−1 −1 . Lemma 3.3 For the mth node on the ith level, we have post(x) = m k 48 d−i+1 +1

k−1

.

k 2 2 10 100 1000 100,000

d 2 10 10 2 5 2

E(|pre − post|) 1.33 12.26 16.20 1.98 7.99 2.00

Table 3.5: Values of E(|pre − post|)

Proof: The ﬁrst node on the level comes after all nodes in its subtree. Hence: k d−i+1 − 1 + (m − 1)α k−1 k d−i+1 − 1 k d−i+1 − 1 = + (m − 1)( ) k−1 k−1 k d−i+1 − 1 = m k−1 = kd−i+1 −1 . k−1

post(x)

Lemma 3.4 For the ith level, pre − post = i −

Proof: Let x be any node on the ith level, say the mth . Then: pre − post = pre(x) − post(x) by Lemma 3.1 k d−k+1 − 1 k d−i+1 − 1 −m = i + (m − 1) k−1 k−1 by Lemmas 3.2 and 3.3 k d−i+1 − 1 = i− k−1 We now consider the average absolute distance between the preorder and postorder identiﬁers: d E(|pre − post|) = i=1 k d−i+1 − 1 k i−1 (k − 1) (|i − |) d−1 k k−1

We tabulate the values of this quantity for various values of k and d in Table 3.5. Of greatest interest in the table are those values for which k is large and d is very small, because these correspond to the topology of typical real-world XML documents, such as DBLP. As can be seen from the table, in these cases the diﬀerence is very small. We are now in a position to give an informal justiﬁcation for the original claim, that simultaneously updating the preorder and postorder lists will not be twice as expensive as updating one of them. Let us consider the situation where we are inserting into a node into an XML document, and this insert requires relabeling npre surrounding nodes in the preorder list, and npost surrounding nodes in the postorder list. While npre and npost will be very diﬀerent, it is a property of many ordering algorithms, such as those of Bender et al [11], that the expected number of relabelings is bounded in the amortized worst case, with the bound depending on the variant employed. Hence, we can expect that on average npre ≈ npost , and we assume below that they are equal and refer to the quantity simply as n. We can then restate our claim as follows: if we access a contiguous sub-list of n nodes in the preorder traversal, how many of these also lie “near” the contiguous sub-list of n nodes in the postorder traversal, starting from the same node? We deﬁne “near” as being close in terms of the preorder traversal, since that is often how nodes are clustered in an XML database. If the proportion that are close together is high, then we would expect that doing the traversals together would involve strictly less than twice the 49

number of disk accesses that would be required if the proportion is low (because the nearer two nodes are together, the greater the chance that they lie on the same page). We now argue that the proportion is, on average, quite high. We have shown above that for complete trees with fanout k and degree d, the quantity E(|pre − post|) is generally small in the cases we are interested in. In fact, although it is omitted from this chapter, a similar result also holds for complete trees with where the fanout is ﬁxed on each level, e.g., the ﬁrst level has fanout k1 , the second level has fanout k2 , etc. These kinds of trees very closely model real world XML repositories such as DBLP and MEDLINE. Hence, if a node lies in position pre(x) in the preorder traversal of the database, we expect the node will have a position somewhere near pre(x) in the postorder traversal. Thus, we conclude that it is likely that a large proportion of the nodes accessed during a preorder traversal will also be accessed during a postorder traversal. It is also intuitively obvious that as the number of nodes n being relabeled increases, the proportion also increases. It is important to emphasize at this point that when we refer to pre(x) and post(x), we are not talking about the numeric tag assigned by the document ordering algorithm. While such tags are indeed ordered consistently, gaps of arbitrary size may lie between adjacent tags, and hence the results above do not apply. Instead, pre(x) refers to the absolute position of the node in preorder traversal, when taking a snapshot of the (possibly dynamic) database, and likewise for post(x). As long as this static snapshot can be reasonably approximated by a complete tree of small depth, which we assert happens frequently in practice, then our result should hold quite closely. To summarize, we have shown that the disk accesses required for updating the preorder and postorder lists are shared between the updates for both lists, and this reduces the cost of the total update cost for this ancestor-descendant scheme. This reduction, coupled with the extra update expense required for the start/end scheme when using an O(log n) update algorithm, strongly suggests that a preorder/postorder scheme is the better strategy for use in large XML databases.

3.6.2

Experimental Justiﬁcation

In order to verify the above argument, we ran several experiments on real-world and synthetic data sets. Our experiments were run on the same hardware as the experiments of Section 3.5. The data sets we used were DBLP, MEDLINE, and XMark. In our experiments, we created an in-memory XML database from each of the data sets, maintaining order information incrementally. We tested the two diﬀerent ancestor-descendant schemes (start/end and pre/post), and used the O(log n) algorithm of Bender et al to maintain the tag values. We did not use the O(1) variant due to the reasons outlined in Chapter 2; similarly, we did not test our improvements to this algorithm suggested in Chapter 2 because they do not impact signiﬁcantly on the purpose of these experiments. In our ﬁrst experiment, we timed how long it took to create each database. These results are summarized in Figure 3.5, which clearly demonstrates that pre/post takes less time than start/end. The reason for this was outlined above: maintaining ordering information for a database of n nodes using a start/end labeling scheme requires running the order maintenance algorithm on a list of 2n nodes, whereas using a pre/post labeling scheme requires two lists of n nodes. As the algorithm is O(log n), it is clear that pre/post should perform better. Our second experiment was to verify the informal argument above that, in a page-based database system clustered in document order, maintaining both preorder and postorder identiﬁers results in many pages being hit twice by the maintenance algorithms, and hence fewer pages will be loaded into memory. We emulated this scenario by assuming that each page holds k nodes of the XML document (where k was varied between 100 and 1000), clustered in document order; as the ordering information was maintained, we counted the number of pages which were accessed both during the preorder traversal and the postorder traversal. The results of this second experiment are given in Figures 3.6 through 3.8. The graphs shows the average percentage of pages that were accessed by the postorder traversal, but not by the preorder traversal. As can be seen in the ﬁgures, the number of pages that the postorder traversal accessed which the preorder traversal also accessed is always over 60% for real world data, and as the size of the page increases, the percentage of shared pages increases. This conﬁrms our argument above, and indicates that updating both preorder and postorder labels at the same time generally does not incur signiﬁcantly greater cost than updating only preorder labels. 50

Insertion Times for Different Ancestor-Descendant Schemes

400

350

300

250 Time (s)

200

Start/End Pre/Post

150

100

50

0 DBLP MEDLINE Data Set XMark

Figure 3.5: Database creation times using the O(log n) order maintenance algorithm of Bender et al.

Page Distribution for DBLP

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 100 200 300 400 500 600 700 800 900 1000 Page Size (k) % Shared Pages % Extra Pages

Percentage (%)

Figure 3.6: Page distribution for DBLP.

51

Page Distribution for MEDLINE

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 100 200 300 400 500 600 700 800 900 1000 Page Size (k) % Shared Pages % Extra Pages

Percentage (%)

Figure 3.7: Page distribution for MEDLINE.

Page Distribution for XMark

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 100 200 300 400 500 600 700 800 900 1000 Page Size (k) % Shared Pages % Extra Pages

Percentage (%)

Figure 3.8: Page distribution for XMark.

52

3.7

Conclusions

This chapter has focused on improving the algorithms for structural join, a core operation for XML query processing. We presented a simple, yet eﬃcient, improvement to the work of Al-Khalifa et al [7], which skips unnecessary nodes in the ancestor and descendant lists. In contrast to work like [23], our method does not require any auxiliary index structures and hence is signiﬁcantly easier to maintain. It can also be implemented in non-database applications such as an XSL processor, which would not normally have a built-in B-tree index, and also has applications to streaming XML data. Furthermore, an informal justiﬁcation of the eﬀect of updates on the structural join problem has been presented. Since the ordering scheme we used in this paper is based on the use of preorder and postorder identiﬁers, the update cost is identical to those analyses and experiments performed in [11] and the work of Chapter 2. By employing the use of gaps in a theoretically sound fashion, the amortized update cost is much lower than the update cost in other tree-based labeling schemes such as [57]. Finally, extensive experiments on both real-world data and synthetic data have shown that our extension has improved the performance of the state of the art structural join algorithm [7] by orders of magnitude in the best case, with only a minor penalty in the worst case. We believe that there is still a wide range of interesting research problems in this area. Our work has demonstrated that each join algorithm has diﬀerent strengths, and hence it would be useful to determine before the join which algorithm to use. This could perhaps be achieved by using sampling techniques to infer the structure of the two input sets. Such a technique would be extremely useful in the context of integrating our work into a complete XML query processor, which remains a challenging task.

53

Chapter 4

Dynamic Selectivity Estimation

I never guess. It is a shocking habit destructive to the logical faculty. — Sir Arthur Conan Doyle (1859 – 1930)

4.1

Introduction

An important component of any database system is the query optimizer, which is responsible for choosing an eﬃcient evaluation plan for each query. One of the key decisions the optimizer must make is to choose the most eﬃcient combination of join operations to evaluate the plan. For XML databases, as we have seen in Chapter 3, one of the most important join operators is the structural join operator, and query optimization in the presence of this operator has already been considered in the literature using dynamic programming [97]. The cost function for a query plan, however, depends crucially on the sizes of the sets being joined, and hence access to accurate compile-time estimates of the sizes of the input sets to each operator is essential to intelligent plan selection. These estimates are computed using selectivity estimation, a well-researched topic for relational databases. As an example of the importance of reliable selectivity estimates, consider the evaluation of the path query //a//b[.//c]. We listed a few possible evaluation plans for this query in Section 3.2.2. Here we compare when one might choose between two of these possible plans: • πd (A ⋊ πa (B ⋊ C)): This plan is suitable when the selectivity of the query //b//c is small, because ⋉ ⋉ the second join will have one small input set, which will signiﬁcantly speed up the join. • πa (πd (A ⋊ B) ⋊ C): This plan is suitable when the selectivity of the query //a//b is small, for the ⋉ ⋉ same reason as described above. Making an incorrect choice between the two listed above would yield unnecessarily slow query execution times. For instance, suppose the query //a//b had huge selectivity, but the query //b//c had low selectivity. In this situation, choosing the second execution plan would be ineﬃcient, because the result of A ⋊ B would be large. It would be much more eﬃcient to ﬁrst join B ⋊ C, which results in a small ⋉ ⋉ result set, which should then speed up the second join. Clearly, in order to make an informed choice between the two plans (and the other possibilities not listed), the query optimizer must have access to accurate estimates of query selectivities. There has been a reasonable amount of work on selectivity estimation for XML databases [3, 21, 45, 72, 84, 85]. However, all existing techniques build a synopsis structure over a static XML database. The issue of maintaining these synopses over dynamic databases has not yet been investigated; simply rebuilding the synopsis periodically can be very ineﬃcient and lead to inaccurate estimates if the data changes wildly between updates to the synopsis. Moreover, the construction algorithm for all existing techniques requires multiple passes over the database, which makes the synopsis construction ineﬃcient and expensive for very large databases. In this chapter, we provide some preliminary research on selectivity estimation for dynamic XML databases, which appears to be an extremely diﬃcult problem. Our contributions are: 54

• We show that selectivity estimation for path expressions is closely related to the frequent itemsets problem in data mining, which is a diﬃcult problem to answer quickly. Hence, we suspect that providing theoretical guarantees on the quality of the synopsis structure will be extremely diﬃcult to achieve in practice, and ad hoc methods may be needed instead. • We discuss some of the problems with generalizing graph-based synopsis structures, which have proven to be the most accurate in the literature. The fact that there are no obvious techniques for generalizing these synopses indicates the diﬃculty in developing even ad hoc techniques for solving the general selectivity estimation problem. • With the above diﬃculties in mind, we adapt recent results from the data stream literature, and the Markov table method of Aboulnaga et al [3], to handle dynamic data, thus yielding the ﬁrst dynamic selectivity estimation technique for simple XML path expressions. However, even in this case we are unable to give absolute quality guarantees in the general case, and hence we evaluate our algorithms empirically. In this chapter, we ﬁrst cover work related to selectivity estimation in Section 4.2. We then provide the correspondence between selectivity estimation for path expressions and the frequent itemsets problem in Section 4.3. Due to this negative result, we then investigate some ad hoc estimation techniques, which utilize several results from data streams, covered in Section 4.4. Using these results, we generalize some existing techniques for selectivity estimation for simple path expressions in Section 4.5, and present an empirical analysis of these techniques in Section 4.6. Finally, we conclude the chapter in Section 4.7.

4.2

4.2.1

Related Work

Selectivity Estimation for Relational Databases

Selectivity estimation in the context of relational databases is a well researched area which still continues to provide signiﬁcant challenges. Several works [24, 60] have demonstrated that accurate selectivity estimates are critical for good query plan selection. The most common way of performing selectivity estimation in relational databases is through the use of histograms, which divide the value domain into a set of buckets and maintains frequency counts on the buckets. While other techniques have been considered in the literature, such as sampling [73], in recent years histograms have dominated research into selectivity estimation, due to their simple design, and low impact upon query evaluation. Various types of histograms have been considered, including the equi-depth histogram (where every bucket has the same frequency count), and the equi-width histogram (where every bucket represents a range of values of ﬁxed size). There have been various studies investigating the tradeoﬀs between histogram size, bucket partitioning techniques, and optimality [61, 62, 86], which have found that good approximations can be found which reside in small space. For most real-world data, equi-depth histograms have proven to be very eﬀective at approximating single attribute selectivity. Initial attempts to estimate multi-dimensional selectivity used the attribute value independence assumption. This assumption is frequently violated in practice, and hence new techniques were developed for multi-dimensional histograms, which are signiﬁcantly more challenging than single attribute histograms due to the “curse of dimensionality”. Several papers [80, 87] have extended traditional partition-based histograms to the multidimensional case. Credible alternatives to traditional histograms have arisen, such as the technique of Matias et al [75], which uses wavelet decomposition and yields good results. A similar technique, instead using the discrete cosine transform, was presented by Lee et al [67]. More recently, Getoor et al [46] have used Bayesian networks to capture the most important correlations between attributes. The problem of maintaining histograms as the underlying data changes has begun to receive a large amount of attention recently. Gibbons et al [49, 50] presented a new method for maintaining histograms incrementally over dynamic data. Their method makes use of a backing sample of the underlying database, which is a small random selection of the database. Once the error of the histogram reaches an unacceptable level, the histogram is reconstructed from the backing sample. As the backing sample is much larger than the histogram, it is generally stored on disk, which can be a disadvantage. Maintenance of compressed histograms, which maintain buckets for the k largest items, plus one bucket for the remainder, has also been investigated [48]. Matias et al [76] have shown how to maintain histograms using their wavelet based 55

techniques. Gilbert et al [51] gave a complicated sketch-based approach which can be updated with strict guarantees on the quality of the histogram. An alternative approach to maintaining histograms is to adapt the histograms to feedback from the query processor. This has the advantage of reﬁning the histogram in places where queries frequently occur. To this end, Aboulnaga and Chauduri [4] used a split-and-merge technique, similar to that of [50]. Bruno et al [13] have developed a similar adaptive system for multi-dimensional histograms. We have only given a tiny sample of the vast literature on selectivity estimation in relational databases. While some techniques may be borrowed from the relational domain, there is no clear generalization to the more complicated problem of XML selectivity. In particular, the fact that selectivity estimation over relational databases is generally conﬁned to numeric or string values, without any structural conditions, makes the task considerably easier. In fact, because of the additional structural conditions, XML selectivity estimation is very similar to the estimation of the size of joins, which is a very hard problem [5, 20].

4.2.2

Selectivity Estimation for XML Databases

Selectivity estimation for static XML documents has been studied extensively recently. The literature can be roughly divided into two main areas: selectivity estimation over simple path queries, and selectivity estimation over branching path queries. Aboulnaga et al [3] were the ﬁrst to consider the selectivity estimation problem for simple path queries. They proposed two diﬀerent synopsis structures. The ﬁrst structure, a pruned path tree, was constructed by pruning low frequency paths from the tree of all root-to-leaf paths in the document. The second structure, a Markov table, treated paths as being generated by a Markov process of order k. By storing the selectivity of all high frequency paths of length up to k, the selectivity estimation of longer paths could be approximated using a formula based upon the Markovian assumption. The paper investigated several diﬀerent pruning strategies, and found that the Markov table structure was superior to the pruned path tree in practice. The disadvantage of both approaches is that the entire structure must be constructed before it can be pruned, which can be very expensive in terms of space. Also, the experimental results showed that diﬀerent structures performed better under diﬀerent query workloads, and there was no one clearly dominant performer. XPathLearner [72] used the idea of a Markov table to consider the problem of adaptive selectivity estimation — instead of accessing the base data to infer selectivity, XPathLearner learned the selectivity through feedback from the query processor. The disadvantages of this approach are that they used a simplistic model for value selectivity, and the adaptive nature meant that during convergence inaccurate results were given. The ﬁrst paper to study the problem of selectivity estimation for more complicated queries was that of Chen et al [21], which studied the problem of estimating the number of matches in a tree-structured database against a twig query. Their work used a suﬃx tree over the set of path expressions, which was then pruned until it was the appropriate size. The disadvantages of their approach are that, again, the whole suﬃx tree must be constructed before it is pruned, and also that their method does not generalize to handle the descendant operator of XPath. Freire et al [45] recently implemented a system, Statix, which handles selectivity estimation in the presence of an XML Schema. Their work builds a histogram of data values for each element type in the schema. These histograms are constructed upon the object identiﬁers of the objects in the database. Thus, the quality of their estimation is highly dependent upon the identiﬁers assigned to the objects, and as the distribution of these identiﬁers is depends upon the order in which nodes are inserted into the database, it is diﬃcult to guarantee good results. Moreover, the scheme is not designed to work within a ﬁxed amount of space — instead, one histogram is assigned to each element type. This means that the same amount of space is used for each type, which may not be the most eﬃcient allocation of the available space. Research upon the estimation of result sizes for the structural join operator, such as [94, 96], are also relevant to selectivity estimation for path expressions, since they estimate selectivity for paths of the form //p1 //p2 . However, it is not clear how these results can be generalized to handle more complicated path expressions. Furthermore, work such as that of Halverson et al [58] show that the structural join is not the only useful XML processing operator, and hence considering selectivity for more general path expressions is useful. Recently, Polyzotis and Garofalakis [84, 85] have developed an elegant, general framework, XSketch, 56

which provides good estimates for far more complicated path expressions on general graph-structured databases. The authors, on the basis of some well-speciﬁed assumptions, present an algorithm to reﬁne their graph synopsis so that it is more focused on places where the assumptions fail. The primary shortcoming of this method is that the construction process for the synopsis is particularly expensive.

4.3

The Diﬃculty of Selectivity Estimation for Path Expressions

In this section, we outline the diﬃculties faced when handling dynamic selectivity estimation. Our arguments suggest that it is not worth searching for algorithms with strong theoretical guarantees for this problem, because such algorithms are either impossible to construct, or at the very least have very high runtime overheads. We ﬁrst deﬁne the problem we face more precisely: Deﬁnition 4.1 (The XML Selectivity Estimation Problem) Consider an initially empty XML tree (containing only a single node, the “virtual” root), and a stream of events E1 , E2 , . . . upon this tree, where each event Ei is either: 1. The insertion of a new node, of type pn , as a child of some existing node in the database with root-to-leaf path /p1 /p2 / . . . /pn−1 ; or 2. A deletion of a leaf node in the database with root-to-leaf path /p1 /p2 / . . . /pn . Given such an XML database D, construct a sketch S of D in small space (preferably sublinear in |D|), such that: • When the underlying database D changes, the corresponding updates to the sketch structure can be performed quickly (preferably in time polylogarithmic in |D|); and • At any point, for any query over the database D, we can use S to quickly estimate (again, preferably in time polylogarithmic in |D|) the size of the result of the query, as well as obtain error bounds on this estimate. This is the most general deﬁnition of the problem. When discussing queries over an XML database, we will in this section restrict ourselves to XPath expressions — even for this relative simple language the problem is quite diﬃcult. It is important to note that, in our formulation of the problem, when considering each insertion or deletion, the only information we have available to us is the root-to-leaf path giving the location of the event. This information is generally easy to obtain; for instance, if we are bulk-loading an XML document using a SAX parser, the root-to-leaf path of the current node can be maintained using a stack linear in the depth of the document. Moreover, because we require our solution to use only a small amount of space and to allow eﬃcient queries, once we have considered a node it is unlikely that we go back at a later time to reconsider it. This restriction makes the problem signiﬁcantly more challenging than the static selectivity problem, where we can visit each node multiple times if necessary. We will now demonstrate a link between selectivity estimation for path expressions and the frequent itemsets problem, one of the fundamental data mining problems. As this is an especially diﬃcult problem to perform eﬃciently, and (as we will show) selectivity estimation for path expressions is at least as hard as this problem, it is unlikely that eﬃcient algorithms exist for estimating selectivity over path expressions. The frequent itemsets problem was ﬁrst introduced in a seminal paper by Agrawal et al [6], as the main subproblem in the problem of mining association rules. Speciﬁcally: Deﬁnition 4.2 (The Frequent Itemsets Problem) Let U be the universe of all items, and deﬁne a transaction to be a subset of U. Given a set of transactions, the frequent itemset problem is to ﬁnd all subsets X ⊆ U which occur in at least a fraction s of all transactions. The parameter s is the necessary support for an itemset to be deemed frequent. The fastest current (approximate) algorithm for the frequent itemsets problem is that of Manku and Motwani [74], which is also the only one pass algorithm for the problem. However, the algorithm has quite a heavy implementation, and relies on many system-level optimizations; even then, the algorithm is not 57

r t

T1 T2 T3 = {a} = {a, b, c} = {b, c} ...

t a b

t b c

r t a a t b c b t c

a

c

(c) Another possible XML tree

(a) A set of transactions

(b) One possible XML tree

Figure 4.1: The correspondence between the frequent itemsets problem and path selectivity estimation.

particularly fast. This is not a problem for data mining, given that frequent itemsets are only collected once; but if we were to apply this technique to selectivity estimation, it would have a signiﬁcant impact on the runtime performance of the database. The fact that in the past decade the frequent itemsets problem has spurned quite a bit of research underscores the diﬃculty of the problem. Moreover, current solutions to the frequent itemset problem do not consider deletions at all, because that is generally unimportant for data mining purposes. We will show below that selectivity for path expressions is at least as diﬃcult as the frequent itemsets problem, by demonstrating a correspondence between the two problems for various expressive fragments of XPath. We will ﬁrst consider the correspondence for simple path expressions which support / (navigation from a parent node to a child node) and [ ] (branching path expressions). An example of the map is given in Figure 4.1 — Figure 4.1(a) gives a sample set of transactions, and Figure 4.1(b) gives the XML document we transform this set of transactions into. More precisely, we model each transaction as a simple two level tree, with the items in the transaction listed as the children. We then combine all these transactions into one XML document. Now suppose we have the ability to estimate the selectivity of path expressions including / and [ ]. Then, we could ﬁnd a solution to the frequent itemsets problem as follows. If we wish to determine the number of transactions in which an itemset X = {x1 , x2 , . . . xn } occurs, then we could simply ﬁnd the selectivity of the path expression /r/t[x1 ][x2 ] . . . [xn ] over the XML document we have constructed above. This equivalence is clear in Figure 4.1. Hence, we could extract from any sketch structure for path selectivity the set of frequent itemsets for a set of transactions, by simply returning the set of most frequent paths of the required form. In fact, we can show a correspondence for an even simpler fragment of XPath, namely simple path expressions which only allow / and // (navigation from an ancestor node to a descendant node), but not [ ]. The correspondence for this case is demonstrated in Figures 4.1(a) and 4.1(c). Instead of listing the n items in the transaction as direct children, we ﬁrst sort them relative to some arbitrary total order on U, and then create a tree of the items of depth n, such that x is an ancestor of y in the tree if and only if x comes before y in the total order. For a given itemset X = {x1 , x2 , . . . xn }, the corresponding path expression is then /r/t//x1 //x2 // . . . //xn , where the items in X are sorted according to the total order. Finally, we brieﬂy outline a correspondence for the simplest possible fragment of XPath, consisting of path expressions allowing only /. In this case, we can represent each transaction T as a suﬃx trie consisting of all subsets of T . More precisely, we represent each transaction T by a tree such that there is a one-to-one correspondence between each subset of T (where the subsets are again sorted according to some total order), and the root-to-leaf paths of the tree. We then combine all these transactions into one document as above, and use the query /r/t/x1 /x2 / . . . /xn to determine the selectivity for a given itemset X = {x1 , x2 , . . . xn }. As the size of the corresponding XML document is considerably larger in this case for even small examples, we do not give a graphical example. 58

We note two interesting facts about the correspondences given above. For the ﬁrst correspondence, if we restrict the fanout of the XML documents we are considering to some ﬁxed f , then the correspondence fails for itemsets larger than f . Similarly, for the second correspondence, if we ﬁx the depth of the document to some ﬁxed constant, the correspondence will fail for large itemsets. Hence, if we were to consider selectivity over bounded trees, the problem might be considerably easier. However, it is not clear that such restrictions are desirable in practice, particularly restricting fanout. These correspondences indicate that we should search for ad hoc solutions to path expression selectivity. It should be noted that all current solutions which work in ﬁxed space in the literature for the static XML selectivity are in fact ad hoc. However, not having guarantees on quality is signiﬁcantly more of an issue in the dynamic case than in the static case, because of the potential for inaccuracies to compound gradually over time.

4.4

Relevant Results from Data Stream Research

In this section, we describe solutions for two problems which we will face, the distinct values problem and the frequent items problem. Before presenting the best known solutions to these problems, we will give the model of a data stream that we will use: Deﬁnition 4.3 (Data Stream) A data stream of n objects is a sequence S = x1 , x2 , . . . , xn , where each xi ∈ V = {v1 , v2 , . . . , vm }. We will write ni for the number of occurences of vi in S. In the above deﬁnition, we have made V ﬁnite. This is in fact an overly weak condition: V could potentially be inﬁnite, without aﬀecting the deﬁnition. Even if V is inﬁnite, at any point in time we will only have seen a ﬁnite subset of it, which can replace the inﬁnite V in the deﬁnition above.

4.4.1

The Distinct Values Problem

We now discuss in detail the available solutions in the literature for the distinct values problem, which we will make use of in later sections. The distinct values problem can be stated simply: Deﬁnition 4.4 (The Distinct Values Problem) Given a data stream S over a set of values V , determine the cardinality of V ; that is, determine the number of distinct values present in the stream. The ﬁrst thing to note about this problem is that it is impossible to ﬁnd an algorithm which solves the problem exactly in space sublinear in m (the size of V); in other words, no algorithm is better than the naive solution of maintaining a hash table of counters. Lemma 4.1 Any algorithm which can compute the exact number of distinct elements requires Ω(m) space. Proof : Let us consider any algorithm which solves the distinct value problem exactly, and use the algorithm as an oracle. Let x be some positive integer, and let P be the set of bit positions in the binary representation of x that are set. First, give the oracle the set P as input. We now show that it is possible to reconstruct the value x from the data stored by the oracle, and hence the minimum amount of space used must be Ω(m). To reconstruct x, we construct P from scratch — given P , x is easily constructed. First, ask the oracle what the number of distinct values is, i.e., |P |. Then, for each i ∈ {1, . . . , ⌈log x⌉}, give i to the oracle as additional input. After every such insertion, query the number of distinct values again. If the number of distinct values has increased, then x has a zero in that position; otherwise, x has a one in that position and hence i ∈ P . (In fact, it can be shown that even if the oracle is allowed to return an incorrect value with some probability, it still requires Ω(m) space.) Thus, we are forced to turn to ﬁnding approximate solutions to this important problem. As with many data stream problems, there are two main approaches to this problem in the literature: sampling and sketch-based techniques. Sampling based methods [15, 17, 19, 56] suﬀer from inaccurate estimates, and Charikar et al [17] gave high lower bounds on the proportion of the database which must be sampled for accurate estimates. Moreover, sampling based methods are not amenable to changes in the underlying database. 59

Sketch based techniques use a summary structure which is incrementally modiﬁed due to insertions and deletions, and have proven to be more successful. Flajolet and Martin [42, 43] gave the ﬁrst such synopsis structure, which assumes the existence of perfect hash functions; Alon et al [8] relaxed this requirement with a simple modiﬁcation to the algorithm. Bar-Yossef et al [10] gave a proof of the space requirements for guaranteed error rates for Flajolet and Martin’s algorithm, as well as a few other algorithms with improved space bounds. Gibbons [47] has also used hashing techniques to ﬁnd estimates for the distinct number of elements in a stream. Recently, Durand and Flajolet [38] extended the work of Flajolet and Martin, and solve the distinct values problem with insertion cost O(log log n), with high accuracy. In practice, this appears to be a very fast algorithm. However, in all of the above works, the generalization to deletions is either not easy, or increases the space and time complexity by a reasonably high constant factor. A completely diﬀerent sketch-based algorithm has been developed by Cormode et al [27, 28], which is the one we will use in later sections, as it is more accurate in practice and extremely simple to implement. Its main disadvantage is speed — in [28], it is found to be a factor of six or seven times slower than the algorithm of Flajolet and Martin. The authors suggest a few methods of speeding up the algorithm; in any case, the performance is still quite acceptable. The basic idea behind the algorithm relies on the concept of a stable distribution: Deﬁnition 4.5 (p-Stable Distribution) A p-stable distribution is a distribution such that if X1 , . . . , Xn are all random variables with p-stable distributions, then a1 X1 +. . .+an Xn is distributed as (Σi |ai |p )1/p X0 , where X0 is also a random variable with a p-stable distribution. The most obvious example of a stable distribution is the Gaussian distribution, which is 2-stable. More details about stable distributions can be found in [83]. The algorithm relies on the properties of a stable distribution to estimate the Hamming norm of a vector (which is an equivalent problem to the distinct values problem). Let us consider the vector n = (n1 , n2 , . . . , nm ), which encodes the counts of all values in V . Then the Hamming norm of this vector (the number of non-zero entries in the vector) is obviously exactly the number of distinct elements seen. If we take the convention that 00 = 0, then we can write the Hamming norm as |n|H = Σi |ni |0 . The approach of [28] is to approximate this as Σi |vi |p for a very small p > 0. This quantity looks remarkably similar to the deﬁning property of p-stable distributions. To see the connection, consider a vector r = (r1 , r2 , . . . , rm ) with values drawn from a p-stable distribution. Then, from the deﬁnition of p-stable distributions, the dot product r · n is a random variable sampled from (Σi |ai |p )1/p X, where X is a p-stable distribution. Hence, |r · n|p gives an approximation (modulo a scaling factor) to Σi |ai |p . There are several remaining practical considerations. To improve accuracy, we actually perform this procedure m′ times independently, and take the median result. However, storing O(mm′ ) values for the m′ random vectors is overly expensive. Instead, we use a pseudo-random generator to generate the random values, using the values from V as seeds (the work of Indyk [59] shows that this does not aﬀect the quality of the ﬁnal result). The complete algorithm, including the procedure for incremental updates, is given in Algorithm 4.1. As can be seen it is extremely simple, and can be implemented almost directly from the pseudocode. The most expensive function in the algorithm is PSTABLE-RANDOM, due to the large number of calls to transcendental functions. To reduce this cost, Cormode et al [28] suggest precomputing values for a given pair of values from [0, 1], at the expense of additional space. Alternatively, Cormode et al [27] provide an approximate technique which generates good results as p → 0. Nevertheless, it should be kept in mind that while we use this algorithm in this chapter due to its high accuracy, other algorithms can be used in its place, such as that of Flajolet and Martin [43] or Durand and Flajolet [38]. We summarize the correctness of this approach in the following theorem, which is taken from [28]: Theorem 4.1 With space O(1/ǫ2 · log 1/δ), Algorithm 4.1 computes an approximation of the number of distinct values to within a factor of 1 ± ǫ with probability 1 − δ, with processing time O(1) for each insertion and deletion (for ﬁxed ǫ and δ).

4.4.2

Finding High Frequency Items

Another problem which we will confront in our selectivity estimation algorithm is the frequent items problem. 60

Algorithm 4.1: Maintenance of the number of distinct values seen. INITIALIZE(m) // Initialize the sketch structure, a vector v of m ﬂoating point numbers 1 for i ∈ {1, . . . , m} do 2 v[i] ← 0.0 INSERT(i, δ, p) // Update the count for value i (δ may be negative) 1 SET-RANDOM-SEED(i) 2 for j ∈ {1, . . . , m} do 3 r ← PSTABLE-RANDOM(p) 4 v[j] ← v[j] + δr PSTABLE-RANDOM(p) // Sample from a p-stable distribution 1 Sample r1 , r2 uniformly from [0, 1] 2 θ = π(r1 − 1 ) 2 3 return sin pθ cos1/p θ cos θ(1−p) − ln r2

1−p p

ESTIMATE-DISTINCT-VALUES(p, s) // Estimate the current number of distinct values, using scale factor s 1 d ← median value of {|v[i]|p | i ∈ {1, . . . , m}} 2 return ds

Deﬁnition 4.6 (The Frequent Items Problem) Given a data stream S deﬁned over a value domain V , return the top k items of V when sorted by frequency. That is, return the k items with highest counts ni , along with these counts. Unfortunately, even solving the above problem approximately has been shown to require Ω(m) space (see Alon et al [8]). To see this informally, one only needs to consider a distribution which is almost uniform, so that the most frequent value is only some tiny fraction more frequent than all other values in the distribution. Due to this poor lower bound, we will actually relax the above deﬁnition so that false positives can be returned. More precisely, we will allow the synopsis structure to report a value which is not one of the top k frequent items, as long as it reports that value with an accurate frequency count. There are two related problems to the frequent items problem which have received some attention in the literature. The ﬁrst is that of iceberg queries, where the goal is to ﬁnd all items which have a frequency higher than a speciﬁed limit (as opposed to the top k items). Fang et al [40] presented several heuristic approaches to this problem, most involving multiple passes over the data. Unfortunately, none of their methods provide provable guarantees on the result. Recently, Manku and Motwani [74] developed two one pass algorithms for this problem which give quality guarantees. These algorithms could be used in this chapter, because we are interested in estimating the frequency of high frequency items. However, because we do not know an appropriate a priori lower limit for the frequencies, we cannot guarantee that we will use only a ﬁxed amount of space. The second related problem is that of ﬁnding hot items, those items which have very high frequency. This sounds identical to the frequent items problem, except that in the hot items problem, we are not interested in the frequency counts. Hence, a solution to the frequent items problem is also a solution to the hot items problem, but the converse is not true. Several one pass algorithms with quality guarantees, which handle insertions only have recently been proposed [31, 65], as well as an algorithm which also handles deletions [29]. None of these algorithms can be generalized in a straightforward manner to the frequent itemsets problem. As mentioned above, the exact frequent itemsets problem is actually impossible to solve in o(m) space, as shown by Alon et al [8]. Therefore, one pass solutions which work in sublinear space are necessarily approximate. There are two one pass solutions available in the literature for the frequent items problem. The ﬁrst is that of Charikar et al [18], which, for a given error rate, has better space complexity for 61

many common input distributions than the second method we will discuss. However, it suﬀers from two problems. Firstly, no bounds were given for the space complexity for general input distributions, and hence there is no clear trade oﬀ in general between space and error rate (the method we will use [48] also suﬀers from this problem). Secondly, as it relies on a priority queue of the top k items, the technique does not generalize to handle both insertions and deletions. The second work, that of Gibbons and Matias [48], is the one we will use in this chapter. They present two methods: one — concise samples — which handles insertions only, and which has provable guarantees on quality, and one — counting samples — which handles both insertions and deletions, but which has no guarantees on quality (currently, there are no techniques in the literature which handle both insertions and deletions with guaranteed error rates). Both techniques maintain a small uniform random sample of the underlying data, and extract the top k items from this sample. We will now discuss the counting sample technique in detail. The basic idea behind the counting sample algorithm is to maintain a set of tuples S { vi , ni }, where ni represents a count for value vi . Upon an insertion of a value v, we ﬁrst check to see if that value is already contained in the counting sample, and if it is, we increment its count. If not, we insert the counting sample with probability 1/τ , for some τ ≥ 1. Upon a deletion, if the value is in the sample, we decrement its count, removing the value from the sample if the count drops to zero. The only diﬃculty is keeping the sample to a suitable size. When the sample size exceeds our desired footprint size, we raise the threshold τ to a new threshold value τ ′ , and resample the counting sample by iterating through and tossing a biased coin in such a way that, at the end of the process, we are left with a counting sample equivalent to one that we would have obtained if we had used the threshold τ ′ all along. Thus, the counting sample represents a uniform sample of the data stream, with each value sampled with probability 1/τ , with the added feature that once a value is in the sample, we maintain its count exactly from then on. (The method of increasing τ can vary, but, as in [48], we use the simple method τ ′ = ⌈1.1 · τ ⌉.) Extracting the k most frequent items is then done by sorting the counting sample, and returning the ﬁrst k values with counts greater than τ − c (this may return fewer than k values if the footprint size is too ˆ small). The value c is a factor which compensates for the time that the value was not in the sample, but ˆ for which values might still have occurred in the stream. This is the heuristic part of the algorithm that e−1 prevents a quality guarantee from being given: the authors show that a value of c = τ e−2 −1 ≈ 0.418·τ −1 ˆ is a good choice. The algorithm is given in Algorithm 4.2. How much space do we need to guarantee that the k most frequent items are included? Our next lemma gives the result (this result was stated without proof in [18]): Lemma 4.2 The amount of space required to ensure that the k most frequent elements occur in the random sample with probability 1 − δ is: k n log nk δ Proof : Let i ∈ S denote the event that the ith most frequent item occurs in the sample. We want: P (∀i ∈ {1, . . . , k}, i ∈ S) = 1 − δ Now: P (∀i ∈ {1, . . . , k}, i ∈ S) ≥ P (k ∈ S)k And hence, if we put P (k ∈ S) = 1 − δ ′ : (1 − δ ′ )k 1 − kδ ′ δ′ ≤ 1−δ ≤ 1 − δ since 0 ≤ δ ′ ≤ 1 δ ≥ k

Let us suppose that the counting sample is sampled with probability p. Then the probability that the k th most frequent item occurs in S is: 62

Algorithm 4.2: Maintenance of the top k most frequent elements. INITIALIZE() // Initialize the counting sample S 1 τ ←1 2 S←∅ INSERT(v) // Insert a new value v into the counting sample S 1 if v has an entry in S then 2 S[v] ← S[v] + 1 3 else 4 Sample r uniformly from [0, 1] 5 if r < 1/τ then 6 S[v] ← 1 7 while footprint of S > maximum footprint size 8 EVICT-SAMPLES() DELETE(v) // Delete a value v from the counting sample S 1 if v has an entry in S then 2 S[v] ← S[v] − 1 3 if S[v] = 0 then 4 Remove v from S EVICT-SAMPLES() // Reduce the size of the counting sample by resampling at a higher threshold 1 τ ′ ← ⌈1.1 · τ ⌉ 2 for each value v in S do 3 Sample r uniformly from [0, 1] 4 if r < τ /τ ′ then 5 continue 6 S[v] ← S[v] − 1 7 while S[v] > 0 do 8 Sample r uniformly from [0, 1] 9 if r < 1/τ ′ then 10 break 11 S[v] ← S[v] − 1 12 if S[v] = 0 then 13 Remove v from S 15 τ ← τ ′ ESTIMATE-COUNT(v) // Estimate the count of a value v 1 if v has an entry in S then 2 if S[v] > τ then 3 return S[v] + 0.418 · τ − 1 4 return 0

63

P (k ∈ S)

= 1 − P (k ∈ S) / = 1 − (1 − p)nk pnk nk = 1 − (1 − ) nk ≥ 1 − e−pnk

Combining the above two equations, we get: δ k ≤ 1 − δ′ ≤ 1 − e−pnk

1 Simplifying yields that p = nk log k , and hence as the expected size of the sample is np, we get the δ desired result. Unfortunately, as can be seen from the above, the formula for the required space depends on nk , the frequency of the k th element, which is not known a priori. Thus, without prior knowledge of the data distribution, for a ﬁxed amount of space we are unable to give any guarantees on whether the k most frequent items occur. However, we can give error bounds on the frequency of items that do occur in the synopsis:

1−

Theorem 4.2 (Theorem 8(iii) of [48]) If a value v occurs in the synopsis S, then its augmented count will lie in the range fv − β · τ, fv + 0.418 · τ − 1] with probability ≥ 1 − e−(β+0.418) , for all β > 0.

4.5

4.5.1

Selectivity Estimation for Simple Path Expressions

Problem Deﬁnition

The dynamic selectivity estimation problem we will consider is for simple path expressions, which is considerably simpler than for the more general path expressions. In this section, we will generalize the techniques of Aboulnaga et al [3] to handle dynamic data. There are two separate summarization techniques we will consider: the use of a pruned path tree, and the use of a Markov table. The type of path expression we consider in this section is a simple path expression: Deﬁnition 4.7 (Simple Path Expression) A simple path expression is a root-to-leaf path of the form //p1 / . . . /pn , where each pi is an element name. Selectivity estimation for simple path expressions is considerably easier than for general structural path expressions, due to the fact that we need not consider the correlations between paths expressions which is necessary for branching path expressions. We will consider this problem in the case of a dynamic XML database, handling both insertions and deletions. It is well known that one can compute the selectivity of simple path expressions exactly using a path tree, and in many cases the path tree is small enough that it can be maintained under insertions and deletions in small space. However, simple path selectivity estimation is still important, as there are cases where the path tree might become too large to be stored exactly, for instance, in internet scale query systems such as Niagara [82], or when estimating selectivity over streaming data. As XML data becomes more popular for storing large amounts of data, it is likely that such cases will increase in frequency.

4.5.2

Pruned Path Trees

The ﬁrst technique used in [3] was pruned path trees. The path tree for an XML database is simply a trie of all root-to-leaf paths in the database, along with a count at each node indicating the selectivity of that path (Figure 4.2 gives the path tree for a simple XML document). Selectivity estimation can easily be performed using this structure, by simply ﬁnding the appropriate node and returning the count value. 64

A 1 B 2 1 C

D 1

1 D

E

3

Path //A //B //C //D //E

Freq 1 1 1 2 2

Path //A/B //A/C //B/D //C/D //C/E

Freq 2 1 1 1 3

(a) An XML document

(b) The corresponding path tree

(c) The corresponding Markov table (k = 2)

Figure 4.2: The path tree and Markov table for an XML document (ﬁgure adapted from Figure 1 of [3])

However, the path tree can be quite large, and hence Aboulnaga et al proposed several techniques to prune the path tree to an acceptable size. Essentially, low frequency nodes are merged into combination nodes (denoted ∗ nodes) until the tree is of an appropriate size. In the original paper, the entire path tree was constructed, and then one of the above strategies was used to prune the tree to an acceptable size. Generalizing this synopsis structure to dynamic data appears diﬃcult. We brieﬂy discuss the issues here, because they pertain to generalizing any graphbased synopsis structure, and graph-based sketches have been the most successful selectivity estimators in practice. As these issues seem to be fundamentally diﬃcult to solve, it seems likely that graph-based synopsis structures can not be made dynamic, and that others techniques must be used instead. Some of the issues with graph based structures include: • If we initially merge two nodes in the path tree into a ∗ node, we may at some later time have to split the ∗ node into two nodes, because of a shift in the distribution of paths. For instance, if we merge two low frequency paths /a/b and /a/c into /a/∗, and if /a/b suddenly becomes the most frequent path in the database, then we should split it out of /a/∗, otherwise the selectivity estimate for /a/c will be greatly skewed. However, to split a ∗ node without scanning the entire database would require the storage of either some sample of the nodes of the database, or some sort of sketch structure summarizing the needed information. In either case, a signiﬁcant amount of space would have to be invested in each ∗ node, defeating the purpose of constructing the ∗ node in the ﬁrst place. • If we have a ∗ node in the summarized tree, then it is not clear as to where new insertions should be included in the synopsis. For instance, suppose we have paths in the summarized structure /a/b and /a/∗, and no others. Upon the insertion of a path /a/c, where should it go? Should it be placed into /a/∗, or should a new path /a/c be created? Moreover, how do we know if a path /a/c was inserted was previously inserted into /a/∗ (which will surely aﬀect our placement of future /a/c insertions)? As the above issues highlight, generalizing graph based structures appears very diﬃcult. At the very least, to solve the ﬁrst issue a backing sample (as in the relational case [48]) would have to be maintained for each and every ∗ node. Even worse, to solve the second issue this backing sample would have to be checked upon almost every insertion! Therefore, due to these diﬃculties, focus should be placed upon alternative dynamic sketch structures. 65

4.5.3

Markov Tables

Aboulnaga et al’s second technique was to use Markov tables. The basic assumption of this method is to model the distribution of paths as a Markov process, that is, the distribution of tags for a particular node in a path depends solely on the preceding tags in that path. Using this assumption, a table is constructed containing the selectivity of all paths of length up to and including k. Figure 4.2 shows the Markov table for a sample XML document, with k = 2. The selectivity of longer paths is then estimated using the formula: n−k f (//p1 / . . . /pn ) = f (//p1 / . . . /pk ) i=1 f (//p1+i / . . . /pk+i ) f (//p1+i / . . . /pk+i−1 )

In order to allow the table to be stored in only a ﬁxed amount of memory, pruning is performed on the lowest frequency values. The method of pruning varies, but the general idea is the same as that of pruning path trees: low frequency paths are combined into ∗ paths. There are three strategies we will consider: • No-*: In this case, low frequency paths are simply omitted, and no ∗ paths are used. • Global-*: In this case, two ∗ paths are maintained, //∗ and //∗/∗. The selectivity of low frequency paths of length one and two are added to these special paths; at the end, the total frequency is divided by the number of paths in each, thus giving the average selectivity of low frequency paths of length one and two. • Suﬃx-*: For each tag t, we maintain a special path //t/∗, representing low-frequency paths of length 2 which begin with t. If at any time the selectivity of a suﬃx-∗ path //t/∗ is the least frequent path in the Markov table, then we merge it into the path //∗/∗. The simple selectivity estimation formula given above must be made slightly more complicated to handle the above cases. We give the full selectivity estimation pseudocode in Algorithm 4.3. Basically, at each stage we ﬁnd the “closest” matching sub-path for which we have selectivity information available. The primary challenge in making this method handle dynamic databases is that we must build the pruned table in a ﬁxed amount of memory. Note that with the above scheme we initially start with a potentially large table, which is then trimmed to size. In the dynamic problem, we wish to start with an empty table, which is then built up incrementally to a size which is no greater than some ﬁxed bound. This is a signiﬁcantly more diﬃcult problem, and it is easy to see that we cannot get an exact solution. It is made even more diﬃcult if we wish to preserve additional information through the use of ∗ paths, which Aboulnaga showed increased accuracy.

4.5.4

Dynamic Markov Table Techniques

Dynamic No-∗ Markov Tables The pruning of a Markov table discussed in the previous section is very similar to the frequent items problem discussed in Section 4.4.2: we remove low frequency paths, retaining only the m highest frequency paths (where m is the footprint of the pruned Markov table). Thus, we immediately get a method of generalizing the no-* pruning technique: maintain a Markov table of order k, and upon the insertion or deletion of a path (due to the addition or removal of a node), insert or delete this path into a frequent items synopsis structure (such as a counting sample). Similarly, queries on the selectivity of a path should be mapped directly into the frequent items synopsis’s query mechanism. Dynamic Global-∗ Markov Tables Extending the use of //∗ and //∗/∗ paths is more challenging. Recall that the selectivity of these paths is the average selectivity of all length one and two paths which were pruned from the Markov table. We will frame our discussion in terms of the estimate for //∗, although the same techniques apply to the f estimate for //∗/∗. We wish to compute f (//∗) = n , where f is the total number of elements which occur in the underlying data, but are not currently listed in the Markov table, and n is the total number of distinct elements occurring in the underlying data, but not currently present in the Markov table. We cannot compute these quantities exactly, so we compute approximations: 66

Algorithm 4.3: Estimates path selectivity using Markov tables with special ∗ paths. ESTIMATE-SELECTIVITY(T , //p1 / . . . /pn ) // Given a pruned Markov table T of order k, estimate the selectivity of the given path expression 1 if n = 0 then 2 return 0 3 elif n = 1 then 4 if //p1 ∈ T then 5 return T [//p1 ] 6 elif //∗ ∈ T then 7 return T [//∗] 8 for i ∈ {2, . . . , k} in descending order do 9 if //p1 / . . . /pi ∈ T then 10 r ← T [//p1 / . . . /pi ] 11 elif i = 2 then 12 if //p1 /∗ ∈ T then r ← T [//p1 /∗] 13 14 elif //∗/∗ ∈ T then 15 r ← T [//∗/∗] 16 else 17 return 0 18 if r is deﬁned then 19 d ← ESTIMATE-SELECTIVITY(T , //p2 / . . . /pi ) 20 if d = 0 then 21 return 0 22 else 23 return r · ESTIMATE-SELECTIVITY(T , //p2 / . . . /pn ) c

ˆ ˆ f = f1 − f2 n = n1 − n2 ˆ ˆ

In the above, f1 is the total number of elements appearing in the document (which is easy to compute exactly), and f2 is the total number of elements appearing in the Markov table — we only have an ˆ approximation f2 to this second quantity, because the frequent items synopsis only returns approximate frequency counts. Similarly, n1 is the total number of distinct elements in the document, and n2 is the total number of distinct elements in the Markov table. We can compute an approximation to n1 using the distinct items sketch structures discussed in Section 4.4.1, and can compute n2 exactly. ˆ Taking the quotient of the quantities f and n yields an approximation to the average path selectivity, ˆ as desired. How accurate is this estimate? We will see that the accuracy varies dramatically, depending on the proportion of k (the number of items in the list of most frequent items) to n (the total number of ˆ unique items). The reason for this dramatic variance is due to the fact that both f and n are computed ˆ as the diﬀerence of two values, one of which is approximate. Lemma 4.3 Suppose x ≥ 0, and 0 ≤ y ≤ x, so that we can write y = f x for some f satisfying 0 ≤ f ≤ 1. Then: ˆ • If |ˆ − y| ≤ ǫy, then |d − d| = ǫ′ d, where ǫ′ = y f 1−f

ˆ ǫ and d = x − y . ˆ f 1−f

ˆ • Similarly, if |ˆ − x| ≤ ǫx, then |d − d| = ǫ′ d, where ǫ′ = x

ˆ ˆ ǫ and d = x − y.

Proof : We show the ﬁrst case; the second case follows by a similar argument. 67

ˆ |d − d| |d|

= = ≤ =

|(x − y ) − (x − y)| ˆ |x − y| |ˆ − y| y |x − y| ǫy |x − y| ǫf 1−f

Thus, we have that the error ǫ′ in the diﬀerence is bounded by: ǫ′ = f 1−f ǫ

The dependence on f means that this error is in fact not bounded, and this makes intuitive sense. ˆ For instance, suppose x = 1.1, y = 1.0, with ǫ = 0.1. Then 0.9 ≤ y ≤ 1.1, and hence 0 ≤ d ≤ 0.2, which ˆ ′ yields an error of ǫ = 1, i.e., a 10% error in y has turned into a 100% error in d. Note that ǫ′ ≤ ǫ if and only if f ≤ 1/2. We can now use the following lemma to determine the total error in the average value computation: Lemma 4.4 Consider x ≥ 0, y ≥ 0, and approximations x and y such that |ˆ −x| ≤ ǫ1 x and |ˆ−y| ≤ ǫ2 y. ˆ ˆ x y Then: x x ˆ − = y ˆ y Proof : We have that: x ˆ (1 + ǫ1 )x (1 − ǫ1 )x ≤ ≤ (1 + ǫ2 )y y ˆ (1 − ǫ2 )y Therefore, the maximum error is: ǫ1 + ǫ2 ǫ1 + ǫ2 , 1 + ǫ2 1 − ǫ2 ǫ1 + ǫ2 1 − ǫ2 x y

ǫ′

= =

max

ǫ1 + ǫ2 1 − ǫ2

Thus, if ǫ2 is very small, then the error in the quotient is approximately ǫ1 + ǫ2 . Putting this all together, we ﬁnd that if k is less than half n, then we can ﬁnd the average selectivity to good precision, and this precision degrades as the proportion stored in the Markov table increases. Dynamic Suﬃx-∗ Markov Tables The most challenging of the dynamic Markov table algorithms is the suﬃx-∗ Markov table, because it captures the most information about the underlying data. We wish to estimate the selectivities //t/∗ for the types t for which these values are the highest. To do this, for every tag t, we need to know: 1. The total selectivity of all length two paths beginning with t, and 2. The number of distinct length two paths beginning with t. 68

In order to make this tractable, we make a few assumptions. Firstly, we assume that we are only interested in types t which have large total selectivities (the ﬁrst item above) — while this is not strictly what we are after, in practice they are almost the same. Hence, for the ﬁrst item, we can maintain a separate frequent items structure which ﬁnds the most frequent types t, according to the total selectivity of all length two paths beginning with that type. The second item is more troublesome — we could simply maintain an array indexed by tag type of distinct items sketch structures, but that could potentially be a huge amount of space. Instead, we adopt a similar technique to that used in the proof of Lemma 4.1. Every time a length two path is added or deleted, we probe the distinct values synopsis structure which we maintain for the path //∗/∗ to see whether adding this new path aﬀects the distinct value count. Of course, Lemma 4.1 shows that we can never do this exactly without using a large amount of space, and in fact it is easy to see that this technique gives us absolutely no guarantees on the quality of the result. Nevertheless, we use this strategy to determine when to insert or delete entries into separate frequent items structure which ﬁnds the most frequent types, according to the number of distinct length two paths beginning with t.

4.6

Experimental Results

In this section, we present an extensive empirical comparison of our new online selectivity techniques against the oﬄine Markov table techniques of Aboulnaga et al [3].

4.6.1

Experimental Setup

As in Chapter 3, the experiments were carried out on a machine with dual Intel Itanium 800MHz processors, 1 GB of RAM, and a 40 GB SCSI hard-drive. The machine ran the Debian GNU Linux operating system with the 2.4.20 SMP kernel. We also used the same data sets as in Chapter 3, namely DBLP [30], MEDLINE [79], and a synthetic data set approximately 200 MB in size generated by XMark [88]. The Markov table techniques of Aboulnaga have been tested experimentally twice before in the literature, by Aboulnaga [3] and by Polyzotis [84]. As the reader may observe, our results here diﬀer signiﬁcantly from those presented in the literature. We believe the reason for this is not the implementation (which was carefully checked), but is due to the query workload we chose to evaluate our algorithms with. Unfortunately, neither of the previous works [3, 84] contained enough information to allow us to reconstruct their query workloads precisely.

4.6.2

Choosing a Query Workload

As in both [3] and [84], we tested on two workloads, a set of positive path expressions (which have non-zero selectivity), and a set of negative path expressions (which have selectivity close to or equal to zero). We will now detail the speciﬁcs of our construction of these workloads, to allow duplication of our experimental results. Generating a Positive Workload The goal is to generate a random set of simple path expressions of the form //p1 / . . . /pm , where m ∈ {l, . . . , u} (in our case, we used l = 1 and u = 5), such that the path expressions had non-zero selectivity. Our approach was to select, uniformly and at random, a subset of paths from the set of all paths of length m which occur in the path tree. Note that this diﬀers from the approach of [84], which biased the workload in favor of high selectivity path expressions. Our sampling procedure works as follows. Consider some node n which lies at depth d in the path tree. Then there are exactly d simple paths which end at node n, (e.g., //pd , //pd−1 /pd , . . . , //p1 /p2 / . . . /pd ). Thus, the number of simple paths in the path tree of length d is equal to the total number of nodes in the path tree which lie at depth d′ ≥ d, i.e.: wm = Σm′ ≥m |{n | n lies at depth m′ }| We store the set of all nodes lying at a depth between l and u in an array, sorted in order of increasing depth (the order of nodes having the same depth is irrelevant). We then choose a random number between l and u with the probability of choosing m being: 69

pm =

wm Σu wi i=l

The number m that we choose gives the length of the path. Using a binary search we then ﬁnd the position i of the ﬁrst node in the array with depth d ≥ m. Finally, we choose a node uniformly from the sub-array beginning at i and ending at the end of the array, and use this as the terminal node in a path of length m. We iterate this procedure (using sampling without replacement) until we generate a suﬃcient number of random paths. The one problem with the method is that if the same path occurs twice in the path tree, then it is counted twice. For instance, in Figure 4.2, the node D appears twice, and hence the singleton path //D has twice the weight of any other singleton path in our extraction process. In practice, this is actually desirable, since repeated paths in the path tree generally imply a greater signiﬁcance. Generating a Negative Workload As in the positive workload case, the goal is to generate a random set of simple path expressions of the form //p1 / . . . /pm , where m lies between the appropriate bounds; in this case, however, we want path expressions with very low selectivity. Our approach was to sample with replacement m diﬀerent values from the set of all element names present in the document. This yielded a set of paths where the vast majority had zero selectivity. Of course, some of these path expressions may by chance have very high selectivity, but it would be artiﬁcial to construct a set of path expressions that all had zero selectivity.

4.6.3

Measuring the Synopsis Size

In contrast to previous experimental work [3, 84], we did not measure the synopsis size in terms of kilobytes of memory used, because this measure is highly dependent on the underlying implementation. Instead, we used the number of entries in the pruned Markov table as the size of the synopsis. For dynamic global-* and suﬃx-*, we did not include the additional space requirement for measuring the number of distinct values, because this was only a small constant overhead — for each distinct values synopsis structure, we used an array of 10 ﬂoating point numbers, and a value of p = 0.02. For suﬃx-*, for a footprint size of k we used k/2 entries for the primary frequent itemsets synopsis, and the remaining k/2 entries for the suﬃx-* speciﬁc synopsis structure.

4.6.4

Experiment 1: Comparing against Static Markov Tables

In the ﬁrst experiment, we compared the performance of the dynamic Markov table algorithms constructed above against their static counterparts. The experiment was performed by bulk-loading all three data sets into a database, constructing the dynamic Markov table synopsis structures on the ﬂy. For the static case, once the documents were loaded into the database, we trimmed the full Markov table down to the desired size. We varied the amount of space allocated for the synopsis structure from 10 to 400 in units of 10. After each synopsis structure was constructed, we used the random query workloads we constructed (as described in the previous section) to measure the error in the estimated selectivity. As our error measurement, we adopted the average absolute error, as in [3]. We did this to avoid the skew that large relative errors on queries returning small selectivities introduced. For instance, estimating a selectivity of 0 for a query with selectivity 1 gives a relative error of 100%, which is an unreasonable measure of the error. Similarly, for queries with zero selectivity, relative error cannot be sensibly computed. Instead of using absolute error, we could adopt the model of Polyzotis et al [84], who used the following formula for error: E= q∈Q |estimated(q) − actual(q)| max{s, actual(q)}

In the above, s is some real number greater than 0, which controls the skew towards results of low selectivity. However, we found that the error results returned by this technique were signiﬁcantly dependent on the value of s, and hence it was diﬃcult to interpret the eﬀectiveness of the various algorithms. 70

The results of our experiments are plotted in Figures A.1 through A.12. For readability, we have split each ﬁgures into two graphs, one measuring the dynamic case and one measuring the static case. In each case, both graphs have the same x and y axis ranges, so that comparisons can be made directly. There are several interesting aspects of our results, in both the static and dynamic cases: • Our static results diﬀer signiﬁcantly from those presented in [3]. In particular, in the positive query workload case, we do not ﬁnd any noticeable diﬀerence between the three pruning techniques. In fact, in Figure A.3, we ﬁnd that suﬃx-* — the method recommended in [3] as the best for real world data — performs more poorly than for the alternative techniques, as can be seen by the spikes in the error graph. The reason for this poor performance is that there was a large variance in the selectivity of length two paths, which meant that for particular element types t1 and t2 , the average selectivity value //t1 /∗ was a poor estimator for most paths //t1 /t2 . A similar problem occurs for global-* in Figure A.5. We are unsure why our results are so diﬀerent, but suspect that it has to do with the diﬀerences in query workloads used by us and by Aboulnaga et al [3]. As they did not give speciﬁc details on their query workload, we are unable to verify their claims. A possible ﬁx for this problem would be to use the median, instead of the mean, as an estimator, but this is considerably more diﬃcult to maintain. • In the static case, we ﬁnd that no-* is the superior technique for negative query workloads, as did Aboulnaga et al [3]. In the positive query workload, we ﬁnd that global-* and no-* perform similarly, however, no-* has the advantage of not having spikes in the error rate, which global-* suﬀers from. Hence, we would recommend the use of no-* in both positive and negative cases. This is an interesting conclusion, because no-* is the Markov table technique which preserves the least information. It appears from our results that the extra information preserved by suﬃx-* is not particularly useful, and the information preserved by global-* can also lead to wildly incorrect estimates in some cases. • In the dynamic case, we ﬁnd that suﬃx-* suﬀers from extremely poor performance. In fact, on most of the graphs, the suﬃx-* line does not appear, because its error rate was in the tens or hundreds of thousands. We attribute this to two factors. As established in the static case, the information preserved by suﬃx-* does not appear to improve selectivity estimation signiﬁcantly, and in many cases makes it worse. This error rate is compounded by the approximate technique we outlined in Section 4.5.4, for which we cannot provide quality guarantees. Analyzing our experimental results, we found that very often the estimates of the denominator in the average were oﬀ by 50%, which led to very bad estimates. • Global-* seems to have comparable performance to no-* in the dynamic case, except that, as in the static case, it is susceptible to huge spikes in the error due to a small change in the synopsis size. These spikes are more of a problem in the dynamic case, which is not surprising, due to the additional approximation of the dynamic techniques. We should note that all three tests represent the worst case for global-*, because the fraction of nodes stored in the synopsis is quite high, thus increasing the error, as discussed in Section 4.5.4. • As in the static case, no-* is the most consistent performer of the three techniques. We note that slightly larger synopsis sizes are required for the dynamic no-* technique to achieve similar error rates to the static no-* technique. This is hardly surprising, given the additional power of the dynamic techniques. • In both the static and dynamic cases, all three Markov table techniques had extremely poor error rates on the XMark database, which is a synthetic data set. The reason for this high error was the presence of a few signiﬁcant elements with low selectivity, but which appeared in almost every path expression. As all other elements had extremely high selectivity, the estimates for the low selectivity elements were skewed substantially. This highlights a fundamental shortcoming of the Markov table technique — the XMark data set represents a set where the Markovian assumption fails to a higher degree than in most other data sets. Thus, this experiment has shown that, in both the static and dynamic cases, suﬃx-* performs quite poorly relative to the other techniques. Global-* also seems to suﬀer from spikes in error, due to the fact that the XML documents we tested resulted in the worst case for it. It appears that, for the moment, no-* outperforms the other two techniques on real world data, and hence should be used in practice. 71

Insertion Time (k = 2)

400

350

300

250 Time Taken (s)

200

150

None No-* Cached Global-* Global-* Cached Suffix-* Suffix-*

100

50

0 DBLP MEDLINE Data Set XMark

Figure 4.3: Insertion Time for k = 2

4.6.5

Experiment 2: Evaluating Insertion Performance

In this experiment, we evaluated the insertion performance of each of our dynamic schemes. We measured the total time necessary to insert each data set in its entirety into an empty database with a synopsis structure of ﬁxed footprint size of 300, and for comparison purposes also measured the time this operation took without any synopsis scheme in use. For global-* and suﬃx-*, which make use of the distinct values solution of [28], we measured the time taken both with a cache of values (as discussed in Section 4.4.1), and without. The time taken in practice for these schemes will lie somewhere in between these two extremes, depending on the trade-oﬀ chosen between increased space and increased time costs. While we evaluated the performance for both k = 2 and k = 3, we found that the times were only negligibly slower for k = 3. Hence, due to the better accuracy of the k = 2 algorithms, we only present the results for that value in Figure 4.3. As can be seen from the ﬁgure, the times for global-* and suﬃx-* can vary dramatically depending on the amount of precomputation that is done. However, one could expect that both could be implemented in practice to run within a factor of two of no-*, which is clearly the fastest of the three. Indeed, it is impressive just how little overhead no-* imposes over the case where no selectivity information was maintained.

4.7

Conclusions

In this chapter, we investigated the diﬃcult problem of dynamic selectivity estimation for XML databases. We demonstrated the diﬃculty of the problem by reducing it to the frequent itemsets problem, which is one of the fundamental, and most diﬃcult, problems in data mining. Hence, it is unlikely that solutions to the selectivity estimation problem which have good theoretical properties cannot be found. Thus, we turned our focus to ad hoc solutions, which have dominated the research up until now. By adapting recent results from the data streams literature, we generalized one of the most well known estimation techniques, pruned Markov tables. We found that the most eﬀective technique was actually the one which preserved the least information, which further serves to highlight the diﬃculty in dynamically capturing the structure of XML documents in small space. This chapter has only given the ﬁrst steps on this extremely diﬃcult problem. Future work includes generalizations of all existing static techniques. In particular, generalizing the graph-based static schemes 72

would be desirable, as they have proven to be the most accurate in practice. However, as we have outlined in this chapter, there are several problems with graph-based solutions that have no clear solution. Thus, we suspect that entirely novel estimation techniques will have to be constructed. Theoretical work investigating restricted data models for which selectivity estimation becomes a simpler problem might also be of assistance in ﬁnding good solutions to this problem.

73

Chapter 5

Conclusions

The end crowneth the work. — Elizabeth I (1533 – 1603) This thesis has addressed three physical level problems fundamental to storing and querying XML data: the maintenance of ordering information, eﬃcient evaluation of the structural join, and eﬀective maintenance of selectivity estimates. In each case, we have found results better than existing techniques in the literature. Our results in order maintenance represent the ﬁrst serious attempt to maintain ordering information in dynamic XML databases. While many techniques have been proposed in the past to represent ordering (such as the work of Tatarinov et al [90]), our work is the ﬁrst to provide methods of updating these schemes eﬃciently. The algorithms presented have wide applicability, and it is expected that any XML database system will have to include a similar update scheme. While theoretically optimal algorithms for the structural join have already been presented in the literature [7], this thesis has presented improvements which can, in many cases, yield speed ups of an order of magnitude or more in practice. As the techniques we have presented are very simple, and the assumptions made are minimal, most implementations of the structural join could be easily extended to include our improvements. The ﬁnal contribution of this thesis was a ﬁrst attempt at developing dynamic XML selectivity estimation. It is clear that the dynamic variant of this problem for full XPath is an extremely diﬃcult problem, and there appears to be no easy generalization of existing techniques to the dynamic case. Nevertheless, we have demonstrated that for simple path expressions, dynamic selectivity estimation can be performed with good accuracy. There is a vast amount of future research in the physical layer of XML databases. The recent works of Halverson et al [58] and Chen et al [22], to name a few, have demonstrated that structural joins are not the complete answer to XML query optimization, and hence the problem of a general XML query optimization framework, especially one which considers order, remains open. Dynamic selectivity estimation for more complicated path expressions is also an important and interesting problem, that would most likely also have signiﬁcant ramiﬁcations for estimating the selectivity of complicated join expressions in relational databases. Finally, the best method of actually storing and representing XML is still unknown; in particular, it is not clear whether native XML database systems or relational databases with additional XML support will end up being the ideal XML storage solution.

74

Bibliography

[1] Serge Abiteboul. Querying semi-structured data. In Proceedings of ICDT, volume 1186 of Lecture Notes in Computer Science, pages 1–18, Delphi, Greece, 8–10 January 1997. Springer. [2] Serge Abiteboul, Haim Kaplan, and Tova Milo. Compact labeling schemes for ancestor queries. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 547–556. Society for Industrial and Applied Mathematics, 2001. [3] Ashraf Aboulnaga, Alaa R. Alameldeen, and Jeﬀrey F. Naughton. Estimating the selectivity of XML path expressions for internet scale applications. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB ’01), pages 591–600, Orlando, September 2001. Morgan Kaufmann. [4] Ashraf Aboulnaga and Surajit Chaudhuri. Self-tuning histograms: building histograms without looking at data. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 181–192. ACM Press, 1999. [5] Swarup Acharya, Phillip B. Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. Join synopses for approximate query answering. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 275–286. ACM Press, 1999. [6] Rakesh Agrawal, Tomasz Imieliski, and Arun Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data, pages 207–216. ACM Press, 1993. [7] Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, and Jignesh M. Patel. Structural Joins: A Primitive for Eﬃcient XML Query Pattern Matching. In ICDE. IEEE Computer Society, 2002. [8] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 20–29. ACM Press, 1996. [9] Toshiyuki Amagasa, Masatoshi Yoshikawa, and Shunsuke Uemura. QRS: A Robust Numbering Scheme for XML Documents. In Proceedings of IEEE ICDE, March 2003. [10] Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. In Proc. 6th International Workshop on Randomization and Approximation Techniques in Computer Science, pages 1–10, 2002. [11] Michael A. Bender, Richard Cole, Erik D. Demaine, Martin Farach-Colton, and Jack Zito. Two simpliﬁed algorithms for maintaining order in a list. In Proceedings of the 10th Annual European Symposium on Algorithms (ESA 2002), volume 2461 of Lecture Notes in Computer Science, pages 152–164, Rome, Italy, September 17–21 2002. [12] Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, and Eve Maler. Extensible Markup Language (XML) 1.0 (second edition). http://www.w3.org/TR/2000/REC-xml-20001006, 2000. [13] Nicolas Bruno, Surajit Chaudhuri, and Luis Gravano. Stholes: a multidimensional workload-aware histogram. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pages 211–222. ACM Press, 2001. 75

[14] Nicolas Bruno, Nick Koudas, and Divesh Srivastava. Holistic twig joins: optimal XML pattern matching. In Michael J. Franklin, Bongki Moon, and Anastassia Ailamaki, editors, SIGMOD Conference, pages 310–321. ACM, 2002. [15] J. Bunge and M. Fitzpatrick. Estimating the number of species: A review. Journal of the American Statistical Association, 88(421):364–373, March 1993. [16] Online Computer Library Center. Introduction to the Dewey Decimal Classiﬁcation. http://www. oclc.org/oclc/fp/about/about the ddc.htm. [17] Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. Towards estimation error guarantees for distinct values. In Proceedings of the nineteenth ACM SIGMOD-SIGACTSIGART symposium on Principles of database systems, pages 268–279. ACM Press, 2000. [18] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In ICALP: Annual International Colloquium on Automata, Languages and Programming, pages 693– 703, 2002. [19] Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. Random sampling for histogram construction: how much is enough? In Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pages 436–447. ACM Press, 1998. [20] Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. On random sampling over joins. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 263–274. ACM Press, 1999. [21] Z. Chen, H. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. Ng, and D. Srivastava. Counting twig matches in a tree. In 17th International Conference on Data Engineering (ICDE’ 01), pages 595–606, Washington - Brussels - Tokyo, April 2001. IEEE. [22] Zhimin Chen, H.V. Jagadish, Laks V.S. Lakshmanan, and Stelios Paparizos. From Tree Patterns to Generalized Tree Patterns: On Eﬃcient Evaluation of XQuery. In Proceedings of VLDB, 2003. [23] Shu-Yao Chien, Zografoula Vagena, Donghui Zhang, Vassilis J. Tsotras, and Carlo Zaniolo. Eﬃcient structural joins on indexed XML documents. In VLDB Conference, pages 263–274, Berlin, Germany, 2002. [24] S. Christodoulakis. Implications of certain assumptions in database performance evaluation. ACM Transactions on Database Systems (TODS), 9(2):163–186, 1984. [25] Edith Cohen, Haim Kaplan, and Tova Milo. Labeling Dynamic XML Trees. In Proceedings of PODS, pages 271–281, New York, June 3–5 2002. ACM Press. [26] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to algorithms. MIT Press and McGraw-Hill Book Company, 6th edition, 1992. [27] G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan. Comparing data streams using Hamming norms (how to zero in). IEEE Transactions on Knowledge and Data Engineering, 15(3):529–540, May–June 2003. [28] Graham Cormode, Mayur Datar, Piotr Indyk, and S. Muthukrishnan. Comparing data streams using Hamming norms (how to zero in). In Proc. 28th Int. Conf. on Very Large Data Bases (VLDB 2002), pages 335–345, Hong Kong, China, 2002. Morgan Kaufmann. [29] Graham Cormode and S. Muthukrishnan. What’s hot and what’s not: tracking most frequent items dynamically. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 296–306. ACM Press, 2003. [30] DBLP. http://www.informatik.uni-trier.de/∼ley/db/. [31] Erik D. Demaine, Alejandro L´pez-Ortiz, and J. Ian Munro. Frequency estimation of internet o packet streams with limited space. In Proceedings of the 10th Annual European Symposium on Algorithms (ESA 2002), volume 2461 of Lecture Notes in Computer Science, pages 348–360, Rome, Italy, September 17–21 2002. 76

[32] Kurt Deschler and Elke Rundenstiner. MASS: A Multi-Axis Storage Structure for Large XML Documents. In To appear in Proceedings of the Twelfth International Conference on Information and Knowledge Management, 2003. [33] Alin Deutsch, Mary Fernandez, Daniela Florescu, Alon Levy, and Dan Suciu. A query language for XML. Computer Networks (Amsterdam, Netherlands: 1999), 31(11–16):1155–1169, May 1999. [34] P. Dietz and D. Sleator. Two algorithms for maintaining order in a list. In Proceedings of the nineteenth annual ACM conference on Theory of computing, pages 365–372. ACM Press, 1987. [35] P. F. Dietz, J. I. Seiferas, and J. Zhang. A tight lower bound for on-line monotonic list labeling. Lecture Notes in Computer Science, 824:131–142, 1994. [36] Paul F. Dietz. Maintaining order in a linked list. In Proceedings of the fourteenth annual ACM symposium on Theory of computing, pages 122–127, 1982. [37] Paul F. Dietz and Ju Zhang. Lower bounds for monotonic list labeling. In John R. Gilbert and Rolf G. Karlsson, editors, SWAT 90, 2nd Scandinavian Workshop on Algorithm Theory, volume 447 of Lecture Notes in Computer Science, pages 173–180, Bergen, Norway, 11–14 July 1990. Springer. [38] Marianne Durand and Philippe Flajolet. Loglog counting of large cardinalities. In 11th Annual European Symposium on Algorithms, 2003. [39] D. C. Fallside (Eds). “XML Schema Part 0: Primer”. W3C Recommendation, May 2001. http: //www.w3.org/TR/xmlschema-0. [40] Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, and Jeﬀrey D. Ullman. Computing iceberg queries eﬃciently. In Ashish Gupta, Oded Shmueli, and Jennifer Widom, editors, VLDB’98, Proceedings of 24rd International Conference on Very Large Data Bases, August 24-27, 1998, New York City, New York, USA, pages 299–310. Morgan Kaufmann, 1998. [41] Mary Fern´ndez, Yana Kadiyska, Dan Suciu, Atsuyuki Morishima, and Wang-Chiew Tan. SilkRoute: a A framework for publishing relational data in XML. ACM Transactions on Database Systems, 27(4):438–493, December 2002. [42] P. Flajolet and G. N. Martin. Probabilistic counting. In 24th Annual Symposium on Foundations of Computer Science, pages 76–82, 1983. [43] P. Flajolet and G. N. Martin. Probabilistic counting algorithms for database applications. Journal of Computer and System Sciences, 31(2):182–209, 1985. [44] Daniela Florescu and Donald Kossmann. Storing and querying XML data using an RDMBS. IEEE Data Engineering Bulletin, 22(1):27–34, March 1999. [45] Juliana Freire, Jayant R. Haritsa, Maya Ramanath, Prasan Roy, and J´rˆme Sim´on. StatiX: making eo e XML count. In Michael Franklin, Bongki Moon, and Anastassia Ailamaki, editors, Proceedings of the 2002 ACM SIGMOD international conference on Management of data (SIGMOD-02), pages 181–191, New York, June 3–6 2002. ACM Press. [46] Lise Getoor, Benjamin Taskar, and Daphne Koller. Selectivity estimation using probabilistic models. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pages 461–472. ACM Press, 2001. [47] Phillip B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proceedings of the 27th International Conference on Very Large Data Bases(VLDB ’01), pages 541–550, Orlando, September 2001. Morgan Kaufmann. [48] Phillip B. Gibbons and Yossi Matias. New sampling-based summary statistics for improving approximate query answers. SIGMOD Record: Proc. ACM SIGMOD Int. Conf. Management of Data, 27(2):331–342, 2–4 June 1998. 77

[49] Phillip B. Gibbons, Yossi Matias, and Viswanath Poosala. Fast incremental maintenance of approximate histograms. In Matthias Jarke, Michael J. Carey, Klaus R. Dittrich, Frederick H. Lochovsky, Pericles Loucopoulos, and Manfred A. Jeusfeld, editors, VLDB’97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25-29, 1997, Athens, Greece, pages 466–475. Morgan Kaufmann, 1997. [50] Phillip B. Gibbons, Yossi Matias, and Viswanath Poosala. Fast incremental maintenance of approximate histograms. ACM Transactions on Database Systems, 27(3):261–298, September 2002. [51] Anna Gilbert, Sudipto Guha, Piotr Indyk, Yannis Kotidis, S. Muthukrishnan, and Martin J. Strauss. Fast small-space algorithms for approximate histogram maintenance. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC-02), pages 389–398, New York, May 19– 21 2002. ACM Press. [52] R. Goldman, J. McHugh, and J. Widom. From Semistructured Data to XML: Migrating the Lore Data Model and Query Language. In Workshop on the Web and Databases (WebDB ’99), pages 25–30, 1999. [53] Roy Goldman, Narayanan Shivakumar, Suresh Venkatasubramanian, and Hector Garcia-Molina. Proximity Search in Databases. In Proceedings of VLDB, pages 26–37. Morgan Kaufmann Publishers, 1998. [54] Roy Goldman and Jennifer Widom. DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. In Proceedings of VLDB, pages 436–445, East Sussex - San Francisco, August 1998. Morgan Kaufmann. [55] Roy Goldman and Jennifer Widom. Summarizing and Searching Sequential Semistructured Sources. Technical Report, Stanford University, 2000. [56] Peter J. Haas, Jeﬀrey F. Naughton, S. Seshadri, and Lynne Stokes. Sampling-based estimation of the number of distinct values of an attribute. In Umeshwar Dayal, Peter M. D. Gray, and Shojiro Nishio, editors, VLDB’95, Proceedings of 21th International Conference on Very Large Data Bases, September 11-15, 1995, Zurich, Switzerland, pages 311–322. Morgan Kaufmann, 1995. [57] Jiang Haifeng, Hongjun Lu, Wang Wei, and Beng Chin Ooi. XR-Tree: Indexing XML Data for Eﬃcient Structural Join. In ICDE. IEEE Computer Society, 2003. [58] Alan Halverson, Josef Burger, Leonidas Galanis, Ameet Kini, Rajasekar Krishnamurthy, Ajith Nagaraja Rao andFeng Tian, Stratis Viglas, Yuan Wang, Jeﬀrey F. Naughton, and David J. DeWitt. Mixed Mode XML Query Processing. In Proceedings of VLDB, 2003. [59] Piotr Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proc. 41st Symp. Foundations of Computer Science, pages 189–197. IEEE, Nov 2000. [60] Yannis E. Ioannidis and Stavros Christodoulakis. On the propagation of errors in the size of join results. In Proceedings of the 1991 ACM SIGMOD international conference on Management of data, pages 268–277. ACM Press, 1991. [61] Yannis E. Ioannidis and Stavros Christodoulakis. Optimal histograms for limiting worst-case error propagation in the size of join results. ACM Transactions on Database Systems (TODS), 18(4):709– 748, 1993. [62] Yannis E. Ioannidis and Viswanath Poosala. Balancing histogram optimality and practicality for query result size estimation. In Proceedings of the 1995 ACM SIGMOD international conference on Management of data, pages 233–244. ACM Press, 1995. [63] H. V. Jagadish, S. Al-Khalifa, A. Chapman, L. V. S. Lakshmanan, A. Nierman, S. Paparizos, J. M. Patel, D. Srivastava, N. Wiwatwattana, Y. Wu, and C. Yu. TIMBER: A native XML database. VLDB Journal: Very Large Data Bases, 11(4):274–291, 2002. [64] Haim Kaplan, Tova Milo, and Ronen Shabo. A comparison of labeling schemes for ancestor queries. In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, pages 954– 963. Society for Industrial and Applied Mathematics, 2002. 78

[65] Richard M. Karp, Scott Shenker, and Christos H. Papadimitriou. A simple algorithm for ﬁnding frequent elements in streams and bags. ACM Transactions on Database Systems (TODS), 28(1):51– 55, 2003. [66] W. Eliot Kimber. HyTime and SGML: Understanding the HyTime HYQ Query Language. Technical Report Version 1.1, IBM Corporation, August 1993. [67] Ju-Hong Lee, Deok-Hwan Kim, and Chin-Wan Chung. Multi-dimensional selectivity estimation using compressed histogram information. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 205–214. ACM Press, 1999. [68] Yong Kyu Lee, Seong-Joon Yoo, Kyoungro Yoon, and P. Bruce Berra. Index structures for structured documents. In Proceedings of the ﬁrst ACM international conference on Digital libraries, pages 91– 99. ACM Press, 1996. [69] Alberto Lerner and Dennis Shasha. AQuery: Query Language for Ordered Data, Optimization Techniques, and Experiments. In Proceedings of VLDB, 2003. [70] Quanzhong Li and Bongki Moon. Indexing and querying XML data for regular path expressions. In Proceedings of VLDB, pages 361–370, 2001. [71] Hartmut Liefke. Horizontal query optimization on ordered semistructured data. In WebDB (Informal Proceedings), pages 61–66, 1999. [72] Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeﬀrey Scott Vitter, and Ronald Parr. XPathLearner: An On-line Self-Tuning Markov Histogram for XML Path Selectivity Estimation. In Proc. 28th Int. Conf. on Very Large Data Bases (VLDB 2002), pages 442–453, Hong Kong, China, 2002. Morgan Kaufmann. [73] Richard J. Lipton, Jeﬀrey F. Naughton, and Donovan A. Schneider. Practical selectivity estimation through adaptive sampling. In Proceedings of the 1990 ACM SIGMOD international conference on Management of data, pages 1–11. ACM Press, 1990. [74] Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. In Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong, China, August 2002. [75] Yossi Matias, Jeﬀrey Scott Vitter, and Min Wang. Wavelet-based histograms for selectivity estimation. In Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pages 448–459. ACM Press, 1998. [76] Yossi Matias, Jeﬀrey Scott Vitter, and Min Wang. Dynamic maintenance of wavelet-based histograms. In Amr El Abbadi, Michael L. Brodie, Sharma Chakravarthy, Umeshwar Dayal, Nabil Kamel, Gunter Schlageter, and Kyu-Young Whang, editors, VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt, pages 101–110. Morgan Kaufmann, 2000. [77] Jason McHugh, Serge Abiteboul, Roy Goldman, Dallas Quass, and Jennifer Widom. Lore: a database management system for semistructured data. ACM SIGMOD Record, 26(3):54–66, 1997. [78] Jason McHugh and Jennifer Widom. Query optimization for XML. In Malcolm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, and Michael L. Brodie, editors, VLDB’99, Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK, pages 315–326. Morgan Kaufmann, 1999. [79] MEDLINE. http://www.nlm.nih.gov/bsd/licensee/data elements doc.html. [80] M. Muralikrishna and David J. DeWitt. Equi-depth multidimensional histograms. In Proceedings of the 1988 ACM SIGMOD international conference on Management of data, pages 28–36. ACM Press, 1988. [81] M. Murata, D. Lee, and M. Mani. “Taxonomy of XML Schema Languages using Formal Language Theory”. In Extreme Markup Languages, Montreal, Canada, August 2001. 79

[82] Jeﬀrey F. Naughton, David J. DeWitt, David Maier, Ashraf Aboulnaga, Jianjun Chen, Leonidas Galanis, Jaewoo Kang, Rajasekar Krishnamurthy, Qiong Luo, Naveen Prakash, Ravishankar Ramamurthy, Jayavel Shanmugasundaram, Feng Tian, Kristin Tufte, Stratis Viglas, Yuan Wang, Chun Zhang, Bruce Jackson, Anurag Gupta, and Rushan Chen. The niagara internet query system. IEEE Data Engineering Bulletin, 24(2):27–33, 2001. [83] John Nolan. Stable distributions (available from http://academic2.american.edu/∼jpnolan/stable/ chap1.ps). [84] Neoklis Polyzotis and Minos Garofalakis. Statistical synopses for graph-structured XML databases. In Michael Franklin, Bongki Moon, and Anastassia Ailamaki, editors, Proceedings of the 2002 ACM SIGMOD international conference on Management of data (SIGMOD-02), pages 358–369, New York, June 3–6 2002. ACM Press. [85] Neoklis Polyzotis and Minos Garofalakis. Structure and value synopses for XML data graphs. In Proc. 28th Int. Conf. on Very Large Data Bases (VLDB 2002), pages 466–477, Hong Kong, China, 2002. Morgan Kaufmann. [86] Viswanath Poosala, Peter J. Haas, Yannis E. Ioannidis, and Eugene J. Shekita. Improved histograms for selectivity estimation of range predicates. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, pages 294–305. ACM Press, 1996. [87] Viswanath Poosala and Yannis E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In Matthias Jarke, Michael J. Carey, Klaus R. Dittrich, Frederick H. Lochovsky, Pericles Loucopoulos, and Manfred A. Jeusfeld, editors, VLDB’97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25-29, 1997, Athens, Greece, pages 486–495. Morgan Kaufmann, 1997. [88] Albrecht Schmidt, Florian Waas, Martin L. Kersten, Michael J. Carey, Ioana Manolescu, and Ralph Busse. Assessing XML data management with XMark. In St´phane Bressan, Akmal B. Chaudhri, e Mong-Li Lee, Jeﬀrey Xu Yu, and Zo´ Lacroix, editors, EEXTT, volume 2590 of Lecture Notes in e Computer Science, pages 144–145. Springer, 2003. [89] Michael Sipser. Introduction to the Theory of Computation. PWS Publishing Co., Boston, Massachusetts, 1997. [90] Igor Tatarinov, Stratis D. Viglas, Kevin Beyer, Jayavel Shanmugasundaram, Eugene Shekita, and Chun Zhang. Storing and querying ordered XML using a relational database system. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 204–215. ACM Press, 2002. [91] W3C Recommendation. XML Path Language (XPath) Version 1.0. http:// www.w3.org/ TR/ xpath, November 1999. [92] W3C Working Draft. XML Path Language (XPath) 2.0. WD-xpath20-20021115 , November 2002. http:// www.w3.org/ TR/ 2002/

[93] W3C Working Draft. XQuery 1.0: An XML Query Language. http:// www.w3.org/ TR/ 2002/ WD-xquery-20021115 , November 2002. [94] Wei Wang, Haifeng Jiang, Hongjun Lu, and Jeﬀrey Xu Yu. Containment join size estimation: models and methods. In Proceedings of the 2003 ACM SIGMOD international conference on on Management of data, pages 145–156. ACM Press, 2003. [95] Xiaodong Wu, Mong Li Lee, and Wynne Hsu. A Prime Number Labeling Scheme for Dynamic Ordered XML Trees. In To appear in Proceedings of the twentieth International Conference on Data Engineering, 2004. [96] Yuqing Wu, Jignesh M. Patel, and H. V. Jagadish. Estimating answer sizes for XML queries. Lecture Notes in Computer Science, 2287:590–608, 2002. 80

[97] Yuqing Wu, Jignesh M. Patel, and H. V. Jagadish. Structural join order selection for XML query optimization. In Proc. 19th Int. Conf. on Data Engineering (ICDE 2003), Bangalore, India, 2003. IEEE Computer Society. [98] Masatoshi Yoshikawa, Toshiyuki Amagasa, Takeyuki Shimura, and Shunsuke Uemura. XRel: a path-based approach to storage and retrieval of xml documents using relational databases. ACM Transactions on Internet Technology (TOIT), 1(1):110–141, 2001. [99] Chun Zhang, Jeﬀrey Naughton, David DeWitt, Qiong Luo, and Guy Lohman. On supporting containment queries in relational database management systems. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pages 425–436. ACM Press, 2001.

81

Appendix A

Dynamic Selectivity Experimental Graphs

Static Selectivity Estimation Accuracy for DBLP (k = 2, positive queries) 10000 No * Global * Suffix * 10000

Dynamic Selectivity Estimation Accuracy for DBLP (k = 2, positive queries) Dynamic No * Dynamic Global * Dynamic Suffix *

8000

8000

Absolute Error

4000

Absolute Error

6000

6000

4000

2000

2000

0 0 50 100 150 200 Synopsis Size 250 300 350 400

0 0 50 100 150 200 Synopsis Size 250 300 350 400

(a) Static Case

(b) Dynamic Case

Figure A.1: Experiment 1 Results for DBLP, k = 2, positive query workload

Static Selectivity Estimation Accuracy for DBLP (k = 2, negative queries) 10000 No * Global * Suffix * 10000

Dynamic Selectivity Estimation Accuracy for DBLP (k = 2, negative queries) Dynamic No * Dynamic Global * Dynamic Suffix *

8000

8000

Absolute Error

4000

Absolute Error

6000

6000

4000

2000

2000

0 0 50 100 150 200 Synopsis Size 250 300 350 400

0 0 50 100 150 200 Synopsis Size 250 300 350 400

(a) Static Case

(b) Dynamic Case

Figure A.2: Experiment 1 Results for DBLP, k = 2, negative query workload 82

Static Selectivity Estimation Accuracy for MEDLINE (k = 2, positive queries) 30000 No * Global * Suffix * 30000

Dynamic Selectivity Estimation Accuracy for MEDLINE (k = 2, positive queries) Dynamic No * Dynamic Global * Dynamic Suffix *

25000

25000

20000 Absolute Error Absolute Error

20000

15000

15000

10000

10000

5000

5000

0 0 50 100 150 200 Synopsis Size 250 300 350 400

0 0 50 100 150 200 Synopsis Size 250 300 350 400

(a) Static Case

(b) Dynamic Case

Figure A.3: Experiment 1 Results for MEDLINE, k = 2, positive query workload

Static Selectivity Estimation Accuracy for MEDLINE (k = 2, negative queries) 30000 No * Global * Suffix * 30000

Dynamic Selectivity Estimation Accuracy for MEDLINE (k = 2, negative queries) Dynamic No * Dynamic Global * Dynamic Suffix *

25000

25000

20000 Absolute Error Absolute Error

20000

15000

15000

10000

10000

5000

5000

0 0 50 100 150 200 Synopsis Size 250 300 350 400

0 0 50 100 150 200 Synopsis Size 250 300 350 400

(a) Static Case

(b) Dynamic Case

Figure A.4: Experiment 1 Results for MEDLINE, k = 2, negative query workload

Static Selectivity Estimation Accuracy for XMark (k = 2, positive queries) 20000 No * Global * Suffix * 20000

Dynamic Selectivity Estimation Accuracy for XMark (k = 2, positive queries) Dynamic No * Dynamic Global * Dynamic Suffix *

15000

15000

Absolute Error

10000

Absolute Error

10000

5000

5000

0 0 50 100 150 200 Synopsis Size 250 300 350 400

0 0 50 100 150 200 Synopsis Size 250 300 350 400

(a) Static Case

(b) Dynamic Case

Figure A.5: Experiment 1 Results for XMark, k = 2, positive query workload

83

Static Selectivity Estimation Accuracy for XMark (k = 2, negative queries) 20000 No * Global * Suffix * 20000

Dynamic Selectivity Estimation Accuracy for XMark (k = 2, negative queries) Dynamic No * Dynamic Global * Dynamic Suffix *

15000

15000

Absolute Error

10000

Absolute Error

10000

5000

5000

0 0 50 100 150 200 Synopsis Size 250 300 350 400

0 0 50 100 150 200 Synopsis Size 250 300 350 400

(a) Static Case

(b) Dynamic Case

Figure A.6: Experiment 1 Results for XMark, k = 2, negative query workload

Static Selectivity Estimation Accuracy for DBLP (k = 3, positive queries) 10000 No * Global * Suffix * 10000

Dynamic Selectivity Estimation Accuracy for DBLP (k = 3, positive queries) Dynamic No * Dynamic Global * Dynamic Suffix *

8000

8000

Absolute Error

4000

Absolute Error

6000

6000

4000

2000

2000

0 0 50 100 150 200 Synopsis Size 250 300 350 400

0 0 50 100 150 200 Synopsis Size 250 300 350 400

(a) Static Case

(b) Dynamic Case

Figure A.7: Experiment 1 Results for DBLP, k = 3, positive query workload

Static Selectivity Estimation Accuracy for DBLP (k = 3, negative queries) 10000 No * Global * Suffix * 10000

Dynamic Selectivity Estimation Accuracy for DBLP (k = 3, negative queries) Dynamic No * Dynamic Global * Dynamic Suffix *

8000

8000

Absolute Error

4000

Absolute Error

6000

6000

4000

2000

2000

0 0 50 100 150 200 Synopsis Size 250 300 350 400

0 0 50 100 150 200 Synopsis Size 250 300 350 400

(a) Static Case

(b) Dynamic Case

Figure A.8: Experiment 1 Results for DBLP, k = 3, negative query workload

84

Static Selectivity Estimation Accuracy for MEDLINE (k = 3, positive queries) 30000 No * Global * Suffix * 30000

Dynamic Selectivity Estimation Accuracy for MEDLINE (k = 3, positive queries) Dynamic No * Dynamic Global * Dynamic Suffix *

25000

25000

20000 Absolute Error Absolute Error

20000

15000

15000

10000

10000

5000

5000

0 0 50 100 150 200 Synopsis Size 250 300 350 400

0 0 50 100 150 200 Synopsis Size 250 300 350 400

(a) Static Case

(b) Dynamic Case

Figure A.9: Experiment 1 Results for MEDLINE, k = 3, positive query workload

Static Selectivity Estimation Accuracy for MEDLINE (k = 3, negative queries) 30000 No * Global * Suffix * 30000

Dynamic Selectivity Estimation Accuracy for MEDLINE (k = 3, negative queries) Dynamic No * Dynamic Global * Dynamic Suffix *

25000

25000

20000 Absolute Error Absolute Error

20000

15000

15000

10000

10000

5000

5000

0 0 50 100 150 200 Synopsis Size 250 300 350 400

0 0 50 100 150 200 Synopsis Size 250 300 350 400

(a) Static Case

(b) Dynamic Case

Figure A.10: Experiment 1 Results for MEDLINE, k = 3, negative query workload

Static Selectivity Estimation Accuracy for XMark (k = 3, positive queries) 20000 No * Global * Suffix * 20000

Dynamic Selectivity Estimation Accuracy for XMark (k = 3, positive queries) Dynamic No * Dynamic Global * Dynamic Suffix *

15000

15000

Absolute Error

10000

Absolute Error

10000

5000

5000

0 0 50 100 150 200 Synopsis Size 250 300 350 400

0 0 50 100 150 200 Synopsis Size 250 300 350 400

(a) Static Case

(b) Dynamic Case

Figure A.11: Experiment 1 Results for XMark, k = 3, positive query workload

85

Static Selectivity Estimation Accuracy for XMark (k = 3, negative queries) 20000 No * Global * Suffix * 20000

Dynamic Selectivity Estimation Accuracy for XMark (k = 3, negative queries) Dynamic No * Dynamic Global * Dynamic Suffix *

15000

15000

Absolute Error

10000

Absolute Error

10000

5000

5000

0 0 50 100 150 200 Synopsis Size 250 300 350 400

0 0 50 100 150 200 Synopsis Size 250 300 350 400

(a) Static Case

(b) Dynamic Case

Figure A.12: Experiment 1 Results for XMark, k = 3, negative query workload

86…...

Premium Essay

...DATABASE S YSTEMS DESIGN, IMPLEMENTATION, AND MANAGEMENT CARLOS CORONEL • STEVEN MORRIS • PETER ROB Australia • Brazil • Japan • Korea • Mexico • Singapore • Spain • United Kingdom • United States Copyright 2010 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Database Systems: Design, Implementation, and Management, Ninth Edition Carlos Coronel, Steven Morris, and Peter Rob Vice President of Editorial, Business: Jack W. Calhoun Publisher: Joe Sabatino Senior Acquisitions Editor: Charles McCormick, Jr. Senior Product Manager: Kate Mason Development Editor: Deb Kaufmann Editorial Assistant: Nora Heink Senior Marketing Communications Manager: Libby Shipp Marketing Coordinator: Suellen Ruttkay Content Product Manager: Matthew Hutchinson Senior Art Director: Stacy Jenkins Shirley Cover Designer: Itzhack Shelomi Cover Image: iStock Images Media Editor: Chris Valentine Manufacturing Coordinator: Julio Esperas Copyeditor: Andrea Schein Proofreader: Foxxe Editorial Indexer: Elizabeth Cunningham Composition: GEX Publishing Services © 2011 Cengage Learning ALL RIGHTS RESERVED. No part of this work covered by the copyright herein may be reproduced, transmitted, stored or used in any form or by any means graphic, electronic, or mechanical, including but not limited to photocopying, recording, scanning, digitizing, taping, Web distribution, information networks, or information storage and retrieval systems, except as......

Words: 189848 - Pages: 760

Premium Essay

...DATABASE DESIGN One of the most important if not the most important aspect of a database is the database design process. It is a must that the design is good and functional. A database consists of many different parts of an engineer’s design, which together makes up a complete database system. A database system is designed to manage large bodies of information. Database design is the construction of a database as a subsystem of an information system. Therefore, the database design process is an integral component of the information system design process. ( Chilson and Kudlac, 1983). One of the greatest problems in the application design process is the lack of foresight. During the design process there must be a thorough evaluation of your database, what it should hold, how data relates to each other, and most importantly, whether it is scalable. When it comes to the maintenance of your database it should be easy to maintain, this means storing as limited amount of repetitive data as possible. Let’s say you have a lot of duplicate data, and one instant of that data undergoes a name change, that change has to be made for all occurrences of that data. You can overcome this duplication problem by creating a table of possible values and use a key to refer to the value. That way, if the value changes names, the change occurs only once, in the master table. What are the steps involved in the database design. First, you must determine the purpose of your database, this......

Words: 2020 - Pages: 9

Free Essay

...An Analysis of XML Database Solutions for the Management Of MPEG-7 Media Descriptions MOHSIN ARSALAN SZABIST KARACHI, PAKISTAN Abstract: MPEG-7 based applications will be set up in the near future due to its promising standard for the description of multimedia contents. Therefore management of large amount of mpeg-7 complaint media description is what is needed. Mpeg documents are XML documents with media description schemes defined in xml schema. In this paper are mentioned some of the solutions that help make a management system for mpeg-7. Furthermore these solutions are compared and analyzed. The result shows how limited today’s databases are, when comes the case of mpeg-7. Keywords: MPEG-7, XML, SCHEMA 1. INTRODUCTION The paper is organized in the following manner that sec1 is the introduction to what mpeg-7 is. Then the solutions for its management are mentioned. Then the solutions are analyzed and finally in the end the conclusion is mentioned. 1.1. MPEG-7: Mpeg (moving picture expert group) is the creator of the well known mpeg1 and mpeg2 and mpeg4. Mpeg-7 is an ISO standard which is developed by mpeg group. The mpeg-7 standard formally named “multimedia content description interface” provides a rich of standardized tools which describes a multimedia content. Humans as well as automatic systems process audiovisual information which is within the mpeg7 scope. Audiovisual description tools (the metadata elements and their structure and......

Words: 1061 - Pages: 5

Premium Essay

...Information Science Careers Brian Maltass University of South Florida Introduction They say that we are living in information age, this however is a clear scandal that neither definition nor theory, of information both precise and broad enough to qualify such assertion sensible. Niels Ole Fiennemann in the year 2001 came up with Media’s general history. He said that no Society could exist where exchange and production of information is of little significance. Therefore one cannot compare information societies and industrial societies in any steady way. Information societies could be industrial societies and industrial societies could as well mean information societies. The following is a media matrix he suggested. literature cultures: writing(number systems and primary alphabets),secondly print cultures: print + speech + written texts, Second order alphabetical cultures: written texts + speech + analogue electrical media + digital media and speech based oral cultures .This paper seeks to visit the origin of the Information Technology and the developments it has undergone to become what it is today. In history of development of information technology, the paper looks into challenges that were encountered during the advancement stages. Discussion In today’s era, the crucial influence with regard to the concept of information is borrowed from information theory that was developed by Shannon alongside others.......

Words: 2995 - Pages: 12

Premium Essay

...English IV 23 Nov. 2013 Computer Engineering Throughout the world, engineering is perhaps the most necessary job in any context, whether it be designing buildings, roads, chemicals, or even software or hardware for a computer. In this day and age, this is more evident in the rising computer industry. As society progresses in the 21st century, the need for computer engineers of all kinds can only increase. Coming to realization with the increasing need for those interested in computer engineering, one could see how this would be an ideal field to study. With a diverse range of courses available for computer engineering, from writing software to designing hardware, and the ever increasing use of computers in general, it is easy to see why this career path is one worth pursuit. To start, there are two main paths of computer engineering, as stated previously, hardware engineering and software engineering. Computer hardware engineers research, design, develop, test, and manage the manufacture and installation of various computer hardware ("Computer Engineering"). According to the above website, hardware includes computer chips, circuit boards, computer systems, and equipment such as keyboards, routers, and even printers. If one were to go into this field of engineering, they would also learn of the electrical details which go into hardware engineering. "The work of computer hardware engineers is similar to that of electronics engineers in that they design and test circuits and......

Words: 803 - Pages: 4

Premium Essay

...Interdisciplinary Design Programme and Industrial Management Engineering Indian Institute of Technology, Kanpur 208016, India Email: jayanta@iitk.ac.in jayanta.chatterjee@gmail.com Phone: 91-512-2597858, 91-512-2597376 (O) Mobile: +91-9648117755 Jayanta Chatterjee “To learn, research, teach and consult in my competence areas, to evolve as a person and share my ken to make a difference through creative Innovation” Core Competence • • Research Interest • • Innovation in socio-technical systems Cause Related Marketing. Media & Communication. Global Sales & Marketing Product and Brand Management. New Business Development. • • Dr. Jayanta Chatterjee has 42 years of teaching/research and professional experience in management at different industries and in different countries. Strategic Design of ProductService Systems • • Digital ecosystem & autopoeisis Jayanta started his career in 1972 at Siemens in Sales and Project Engineering and developed expertise in new product management. He then pioneered the introduction of advanced electronic control systems to Indian Industries at Allen-Bradley Ltd, where he rose to the position of CEO in 1990. But true to his passion he was also teaching as a visiting faculty at IIT Kanpur and at IIT Delhi during this period. Later, he co-funded Strategy Innovation Inc and became the Chief Knowledge Officer of vtPlex. In 2001 he divested out of that enterprise and joined the academia full time at Industrial & Management Engineering......

Words: 2148 - Pages: 9

Premium Essay

...Review Questions 24 I Chapter 1 Databases and Database Users 1.1. Define the following terms: data, database, DBMS, database system, database catalog, Program-data independence, user view, DBA, end user, canned transaction, deductive Database system, persistent object, meta-data, transaction-processing application. 1.2. What three main types of actions involve databases! Briefly discuss each. 1.3. Discuss the main characteristics of the database approach and how it differs from traditional file systems. 1.4. What are the responsibilities of the DBA and the database designers? 1.5. What are the different types of database end users? Discuss the main activities of each. 1.6. Discuss the capabilities that should be provided by a DBMS. Exercises 1.7. Identify some informal queries and update operations that you would expect to apply to the database shown in Figure 1.2. 1.8. What is the difference between controlled and uncontrolled redundancy? Illustrate with examples. 1.9. Name all the relationships among the records of the database shown in Figure 1.2. 1.10. Give some additional views that may be needed by other user groups for the database Shown in Figure 1.2. 1.11. Cite some examples of integrity constraints that you think should hold on the Database shown in Figure 1.2. Chapter 2 Database System Concepts and Architecture Review Questions 2.1. Define the following terms: data model, database schema, database state, internal Schema, conceptual schema, external schema, data......

Words: 803 - Pages: 4

Free Essay

...Aglets, originally from IBM, now Open Source, the Java Agent Development Environment (JADE) from the University of Parma, and Ascape, from the Brookings Institute in Washington DC. Communication languages such as the Semantic Language (SL) and XML will also be discussed. Disclaimer General Course Information q q q q q For last year's course (Fall 2000) Course Management Form (Fall 2001) Course Resources and References Assignments Exam ReadMe Course Topics q Introduction: what is an agent? r Agents: Natural and Artificial r r r Ferber's Discussion Lange and Oshima Nikolic q Situated Agents r Agent rationality r r r r r Agent autonomy An Agent Classification Scheme A Basic Reactive Agent Example A Reactive Agent with state Agent environments http://www.ryerson.ca/~dgrimsha/courses/cps720/index.html (1 of 3) [7/24/2002 9:54:41 PM] CPS 720 Artificial Intelligence Programming q Mobile Communicative Agents r The CERN CSC99 Agent Course s Lecture 1 s s Notes on Lecture 1 Notes on Lecture 2 Lecture 2 s s r r r r r q Lecture 3 The Agent and its environment Seven Advantages of mobility Mobile agents vs mobile objects Agents and network computing paradigms The question of location transparency in distributed systems Aglets r What are Aglets? r r r Getting started with Aglets The Aglet Model The Aglet Package s Aglet Mobility s s s Inter-Aglet Communication Cloning Aglet Design Patterns s Survey of Patterns s s s s The Master-Slave Pattern The Sequential Itinerary......

Words: 278775 - Pages: 1116

Free Essay

...services among the students of engineering: a case study of King Saud University Akhtar Hussain Civil Engineering Department, King Saud University, Riyadh, Kingdom of Saudi Arabia, and Abdulwahab M. Abalkhail Department of Library and Information Science, King Saud University, Riyadh, Kingdom of Saudi Arabia Abstract Purpose – The purpose of this paper is to examine the determinants of library use, collections and services among the students of engineering at King Saud University, Riyadh (KSA). Design/methodology/approach – A survey was designed to collect needed information about the level of usage of library collections, services and satisfaction of users. A well-structured questionnaire was circulated among the faculties, research scholars, postgraduates, undergraduates, and other categories to collect the necessary primary data, keeping in mind the objectives of the study. Findings – The ﬁndings clearly reveal that the majority of users of the library used the circulation service. The study found that a majority of research scholars consult the reference books for research work followed by undergraduate students who used the library circulation service. Research limitations/implications – The present paper consists only of College of Engineering users and the geographical area is restricted to the central library at the King Saud University, Riyadh. The scope of the paper could be extended to additional private and government universities in KSA and abroad. A......

Words: 6330 - Pages: 26

Premium Essay

..."Trends in Computer Science Research" Apirak Hoonlor, Boleslaw K. Szymanski and M. Zaki, Communications of the ACM, 56(10), Oct. 2013, pp.74-83 An Evolution of Computer Science Research∗ Apirak Hoonlor, Boleslaw K. Szymanski, Mohammed J. Zaki, and James Thompson Abstract Over the past two decades, Computer Science (CS) has continued to grow as a research ﬁeld. There are several studies that examine trends and emerging topics in CS research or the impact of papers on the ﬁeld. In contrast, in this article, we take a closer look at the entire CS research in the past two decades by analyzing the data on publications in the ACM Digital Library and IEEE Xplore, and the grants awarded by the National Science Foundation (NSF). We identify trends, bursty topics, and interesting inter-relationships between NSF awards and CS publications, ﬁnding, for example, that if an uncommonly high frequency of a speciﬁc topic is observed in publications, the funding for this topic is usually increased. We also analyze CS researchers and communities, ﬁnding that only a small fraction of authors attribute their work to the same research area for a long period of time, reﬂecting for instance the emphasis on novelty (use of new keywords) and typical academic research teams (with core faculty and more rapid turnover of students and postdocs). Finally, our work highlights the dynamic research landscape in CS, with its focus constantly moving to new challenges arising from new......

Words: 15250 - Pages: 61

Premium Essay

...NINTH EDITION Principles of Information Systems A Managerial Approach Ninth Edition Ralph M. Stair Professor Emeritus, Florida State University George W. Reynolds Australia · Canada · Mexico · Singapore · Spain · United Kingdom · United States Principles of Information Systems, A Managerial Approach, Ninth Edition by Ralph M. Stair and George W. Reynolds VP/Editorial Director: Jack Calhoun Senior Acquisitions Editor: Charles McCormick, Jr. © 2010 Course Technology, Cengage Learning ALL RIGHTS RESERVED. No part of this work covered by the copyright herein may be reproduced, transmitted, stored or used in any form or by any means graphic, electronic, or mechanical, including but not limited to photocopying, recording, scanning, digitizing, taping, Web distribution, information networks, or information storage and retrieval systems, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the publisher. Product Manager: Kate Hennessy For product information and technology assistance, contact us at Cengage Learning Customer & Sales Support, 1–800–354–9706 Development Editor: Lisa Ruffolo, The Software Resource For permission to use material from this text or product, submit all requests online at cengage.com/permissions Further permissions questions can be emailed to permissionrequest@cengage.com Editorial Assistant: Bryn Lathrop Content Product Managers: Erin Dowler, Jennifer Goguen......

Words: 375753 - Pages: 1504

Premium Essay

...Partnering with Industry for a Computer Science and Engineering Capstone Senior Design Course Ken Christensen[1], Dewey Rundus1, and Zornitza Genova Prodanoff1 1 2 Abstract A capstone design course is an important component of the senior year curriculum for engineering students and plays a key role in achieving departmental ABET EC 2000 outcomes. In the Department of Computer Science and Engineering at the University of South Florida, we have partnered with industry to have students work in teams on industry-contributed “real world” projects. Industry partners contribute projects, mentor students, give a guest lecture, and provide the opportunity for students to present their project at the industry site. Students work on projects in teams and are given milestones and schedules to follow. The milestones include formal requirements, specifications and design, prototype demonstration, test plan, and final project delivery and presentation. The final presentation includes a project demonstration, user documentation, press release, and a poster that is permanently displayed in the department seminar room. The course includes formal lecture and reading assignments on the development process. A midterm exam covers these topics. Soft topics include discussions on working in teams. Our industry partners stress the importance of students being able to work well in teams. We hope that our experience can serve as a guide for other engineering departments......

Words: 4974 - Pages: 20

Free Essay

...Structure & Development of the New South Wales (NSW) Primary Curriculum The New South Wales Primary Curriculum provides the framework for the outcomes based education currently in use in all Public schools in New South Wales. This essay will present a brief overview of the structure, definition, goals, influences, processes and show how it meets the needs of current and future learners. Drawing from various sources, an examination of the curriculums content and foundation, will provide a snapshot of where the educational direction is headed. Curriculum Structure and Development In 2004 the Board of Studies NSW developed the consultation paper, Defining Mandatory Outcomes in the K–6 Curriculum, which also involved surveys, submissions and state-wide consultation meetings with teachers across NSW. This process helped to bring about the current NSW Primary Curriculum Foundations Statements. Collaborating with teachers and educational professionals the statements developed by the board of studies NSW give clear direction of what must be taught through each of the stages of learning in the K-6 curriculum (The Board of Studies NSW, 2007). The NSW Primary Curriculum is structured into six key learning areas (KLA’s), English; Mathematics; Science and Technology; Human Society and its Environment; Creative Arts; and Personal Development, Health and Physical Education (PDHPE). The KLA’s, along with the syllabus, remain at the core of planning and programming, and are......

Words: 2197 - Pages: 9

Premium Essay

...The Fluidity of Computer Science. Gender Norms & Racial Bias in the Study of the Modern "Computer Science" Computer science or computing science designates the scientific and mathematical approach in computing. A computer scientist is a scientist who specialises in the theory of computation and the design of computers. Its subfields can be divided into practical techniques for its implementation and application in computer systems and purely theoretical areas. Some, such as computational complexity theory, which studies fundamental properties of computational problems, are highly abstract, while others, such as computer graphics, emphasize real-world applications. Still others focus on the challenges in implementing computations. For example, programming language theory studies approaches to description of computations, while the study of computer programming itself investigates various aspects of the use of programming languages and complex systems, and human-computer interaction focuses on the challenges in making computers and computations useful, usable, and universally accessible to humans. Computer science deals with the theoretical foundations of information, computation, and with practical techniques for their implementation and application. History The earliest foundations of what would become computer science predate the invention of the modern digital computer. Machines for calculating fixed numerical tasks such as the abacus have existed since......

Words: 2298 - Pages: 10

Premium Essay

...N E L L D A L E J O H N L E W I S illuminated computer science J O N E S A N D B A RT L E T T C O M P U T E R S C I E N C E computer science illuminated N E L L D A L E J O H N L E W I S computer science illuminated N E L L D A L E J O H N Villanova University L E W I S University of Texas, Austin Jones and Bartlett Publishers is pleased to provide Computer Science Illuminated’s book-specific website. This site offers a variety of resources designed to address multiple learning styles and enhance the learning experience. Goin’ Live This step-by-step HTML Tutorial will guide you from start to finish as you create your own website. With each lesson, you’ll gain experience and confidence working in the HTML language. Online Glossary We’ve made all the key terms used in the text easily accessible to you in this searchable online glossary. The Learning Store Jones and Bartlett Publishers has a wealth of material available to supplement the learning and teaching experience. Students and instructors will find additional resources here or at http://computerscience. jbpub.com The Language Library Here you will find two complete chapters that supplement the book’s language-neutral approach to programming concepts. A JAVA language chapter and C++ language chapter are included and follow the same pedagogical approach as the textbook. http://csilluminated.jbpub.com eLearning Our eLearning center provides......

Words: 67693 - Pages: 271