A paper under this title is published this month in a special issue of the IEEE Data Engineering Bulletin. Available online at:
http://sites.computer.org/debull/A08dec/saxonica.pdf
Of course, nearly all of what it says applies equally to XSLT. But for some reason, the academic community continues to be much more interested in XQuery.
|
|
||||
|
Ten Reasons why Saxon XQuery is Fast
Comments
Re: Ten Reasons why Saxon XQuery is Fast
Interesting.
I am curious about the Tiny Tree/Linked Tree performance comparisons (the DOM doesn't surprise me). The design goal was improved memory usage for Tiny Tree, yet - the memory usage is the metric where Tiny Tree shows only a very small improvement over Linked Tree (370Mb vs 327Mb) - the build time for Linked Tree is much worse than Tiny Tree; the only reason I can think of is Java object allocation overhead. Is there anything else going on here? - the query time for Linked Tree is more than 6 times worse than for Tiny Tree. Why is that? Do you store a document order index in the linked tree nodes? I would also be interested to hear more about what schema information you use, both from the PSVI and directly from the schema. Obviously there is the simple type information for element and attribute values. You also mentioned inferring more explicit paths. I'm wondering whether it would be practical to generate this from RELAX NG. Re: Re: Ten Reasons why Saxon XQuery is Fast
Thanks for the comments. I prepared this article to a rather tight deadline so I didn't have time to do as much analysis and confirmation of these numbers as I would have liked (and there wouldn't have been space to publish it anyway), so my explanations are a bit conjectural.
* I would have expected a bigger difference on memory usage between TinyTree and LinkedTree, and I need to investigate this one further. * On the build time, I don't find the figures surprising; no doubt the difference is partly Java object allocation, and partly overheads within Saxon itself (I've done much more optimisation on the TinyTree), and I would hesitate to guess in what ratio. * The difference in navigation time is probably particularly marked for this example, a search of the descendant axis where few nodes match the nodetest. The basic loop for the Tinytree just says {currentNodeNumber++; stop if the depth isn't greater than the starting depth; check if the node matches the nodetest (without even instantiating a node object unless it does match)}. That's essentially an integer increment and two integer comparisons (and one method call) for each descendant skipped over. For the LinkedTree the equivalent loop looks to see if there are any children, then tries the next sibling, then tries the next sibling of each ancestor in turn: it's much more complex logic and will probably involve at least five method calls in a typical case, in addition to testing the nodetest. As for schema information: the Saxon schema processor doesn't actually generate a full PSVI. All it does (apart from validation!) is to generate the type annotation for each node, essentially a reference to a simple type or complex type. At run-time this information is essentially used only for atomization and type checking in the obvious way. At query/stylesheet compile time, Saxon attempts to do static type inference for every expression, and the main use of schema information here is with path expressions: given $x/e, if the schema type of $x is known, then the schema type of $x/e can be inferred. This is done for the child, descendant, and attribute axes only. There is also some "pre-validation" of output structures, for example detecting that <table><td/></table> is going to generate invalid output against the XHTML schema: again this only uses knowledge of which elements can appear as children/attributes of which others. I suspect one could do quite a lot of this with a Relax-NG schema. There isn't a very heavy reliance on the Element Declarations Consistent constraint - EDC doesn't stop you having two types derived by extension from E that both include an element F but with different types, so Saxon has to allow for this possibility, which means that EDC doesn't simplify the logic. And UPA makes no difference to the rules because Saxon is never (at present) analyzing the permitted sequences of child elements, only which elements are allowed as children and which aren't. Re: Re: Ten Reasons why Saxon XQuery is Fast
I have done some more careful measurements on the memory usage of TinyTree and LinkedTree for this dataset - building the documents in the absence of a query to isolate the memory used for the document itself. I've written a wiki page showing how to determine the sizes both by measurement and by calculation, and the two agree reasonably closely: see http://saxon.wiki.sourceforge.net/EstimatingTreeSize
Re: Ten Reasons why Saxon XQuery is Fast
by
Andy Agrawal
on Thu 28 May 2009 05:20 BST | Profile | Permanent Link
Could you elaborate a little on this paragraph from Section 3.3:
"The main aspect of the analysis is determining combinations of axis steps that are “naturally sorted”. This is the case for any sequence of child axis steps. It is also true for an expression such as a/b//c, but not (perhaps surprisingly) for a//b/c. There is no space here to give the rules in detail." ? Perhaps an example XDM tree to demonstrate why a//b/c is not naturally sorted would suffice. Re: Re: Ten Reasons why Saxon XQuery is Fast
Consider <A><B><B><C/></B><C/></B></A>.
The nested loop "for $b in //B return $b/C" will return the second C first (because its parent comes before the parent of the first C). But the path expression //B/C has to return the first C first. |
Search
Recent Comments
Recent Articles
Month Archive
|
|||