A user of Saxon on .NET, Don Burden, has been doing some performance tests:
https://sourceforge.net/forum/forum.php?thread_id=1493510&forum_id=94027
At first sight the figures are not especially good: 227 transformations per second against 613 for the System.Xml.Xsl transformer. However, closer analysis shows that a great deal of the cost is in converting the System.Xml DOM into a Saxon tree prior to performing the real transformation. This isn't really a surprise - the API documentation contains some clear warnings about the cost of doing this (it's far better, when you can, to build a native Saxon tree directly from raw XML - the same is true for the Java product).
On the other hand, if you want to manipulate your data from C# and then transform it, you haven't got much choice other than to do this conversion, so one can't argue that the measurement is untypical of at least some real applications.
So I decided to pull my fingers out and implement the alternative "wrapper" approach, where a set of lightweight objects are wrapped around the DOM nodes to deliver the Saxon NodeInfo interface. This turned out to be a pretty straight port of the Java code that does the same job wrapping the Java DOM - as one might expect, the two DOM interfaces differ mainly in details of method naming and the like.
As always happens in these cases, the Microsoft API documentation is woefully inadequate. When a method expects a namespace URI, is the null namespace "null" or the empty string? How exactly do EntityReference nodes behave? Often one simply has to do experiments to find out.
For some things, even experimenting won't give you the answer. Does the System.Xml DOM allow you to assume that you'll never have two different objects representing the same node? The rumours suggest that this is a safe assumption, but it's not actually written down in the spec. Sometimes you just have to go for it, and wait for the bug reports if you get it wrong.
Developing this, I decided that it was high time I had a better set of tests for NodeInfo implementations. There are two "complete" NodeInfo implementation in the product (the tiny tree and the linked tree), there are a number of specialised implementations for special cases such as <xsl:variable>XYZ</xsl:variable> (a document node containing a single text node), and there are now five implementations that wrap third-party object models - DOM, DOM4J, XOM, and JDOM on Java, and now the System.Xml DOM on .NET. I've also done a couple of custom implementations for Saxonica clients. Until now I've run a selected set of XSLT stylesheets to validate these implementations, but it was becoming unwieldy. So I've started writing a set of unit tests to do the testing at the Java level.
I also decided it was time I got cross-language debugging to work. Until now, to find Java bugs that only occur in a .NET environment, I've had to resort to System.err.println debugging, which is tedious to say the least (good discipline, though: it makes you think twice when you have to wait ten minutes for your next test run. Memories of the days when you had to wait 24 hours...). But I knew that it was in principle possible to get the Visual Studio debugger to step into the Java code line-by-line, and sure enough, after a frustrating day or two, I've cracked the problem. There seem to be a lot of configuration settings that have to be exactly right: the final one was ensuring that when cross-compiling the saxon JAR into a .NET assembly, the resulting assembly is unsigned and unversioned.
Next step is to do some performance measurements to see how far I've closed the gap against System.Xml.Xsl. Not that Saxon is primarily competing on performance - it's the productivity benefits in XSLT 2.0 that will really influence people - but it would be nice if it's in the same ballpark. Don Burden's measurements show that's not an unreasonable target - even when using a DOM as input, which will never be the fastest way of running Saxon.
|
|
||||||||
|
Login
|
Wrapping the .NET DOM
Comments
Re: Wrapping the .NET DOM
This is awesome! Thanks for both the extended information and all of the work that when into its discovery :)
Definitely will be incorporate all of this new knowledge into my current project set related to Saxon on .NET and beyond. Re: Wrapping the .NET DOM
Wolfgang Hoschek sent me the following comments:
here are some notes that don't pertain to .NET but rather to some unqualified generalization: > it's far better, when you can, to build a native Saxon tree directly from raw XML - the same is true for the Java product. I think the performance characteristics aren't as simple as suggested. There are at least 4 phases that contribute to overall execution time: 1. xml parsing / tree building 2. query compilation 3. query execution 4. serialization Depending on the app some of these phases may be much more expensive than others. Some phases may be completely absent or amortized over many executions. For example, running millions of trivial (sequential) XPath queries per second over records from large XML streams exhibits a quite different profile than dynamic generation of HTML for publishing via XSLT, which in turn is different from running occasional complex queries over very large document (databases). Queries/transforms that construct new trees are very different than searches, etc. Memory consumption may or may not be an issue. Having said that, here are some surprising yet reproducible observations based on tuning the XOM NodeWrapper: 1) XML parsing / tree building is significantly faster with the XOM model than with the saxon Tinytree model (e.g. with SAX/xerces-2.8.0) 2) serialization is faster with XOM than with tinytree I don't have reproducible numbers on the following observations, but at least anectotal evidence suggest that: 3) query execution can go either way, depending on the type of query. The outcome is on a case by case basis. Constructing a new tree is the worst case scenario for the XOM wrapper, but often that's not needed at all. Not everyone's publishing HTML. 4) DOM is far from the best random access tree model, and the saxon DOM NodeWrapper doesn't look like it's been optimized at all, compared to the amount of work that has gone into TinyTree (e.g. the generic AxisIterators use by DOM can be a large time sink). I wouldn't be surprised if some work on the DOM NodeWrapper would yield significant better results for the query execution phase. |
Search
Recent Comments
Month Archive
|
||||||