My heart sank when I received an email yesterday from a Saxonica customer reporting a failure when running a transformation on a 20Gb source file (using the "streaming" capabilities described at http://www.saxonica.com/documentation/sourcedocs/serial.html.) It was apparently failing half way through with no error message of any kind. How do you go about debugging such a problem, especially when you're in a different continent from the client?
Well, every problem is an opportunity, and this was my first chance to experiment with a transformation on this kind of scale - in fact, it's probably 100 times bigger than anything I have personally run before. And I was quite pleasantly surprised that it's more manageable than I expected. Downloading the 2Gb compressed file ran without problems while I was having dinner; decompressing it (using WinRAR) ran again without problems while I watched the news on TV.
I decided first to try my old DTDGenerator application on the source data, to check that the XML was well-formed. This is a pure SAX application with no dependencies on Saxon code. The first attempt ran out of memory; I discovered that the DTDGenerator is (unintentionally) heeping a hash table of all distinct attribute values. So I fixed this bug (first code change since 2001!) and fired it off again. After 40 minutes I decided I needed to get on with other work on my laptop and killed it, starting it again on the shiny new Vista machine sitting in the corner, once I had installed Saxon and Java on it. The DTDGenerator ran successfully on that machine in about 25 minutes, so I started up the transformation proper, which took around 50 minutes, producing a 1.9Gb CSV file as output, running with a constant memory footprint of around 20Mb. That's a processing rate of about 6Mb/sec, which doesn't seem bad.
It's noticeable that the new machine is only running at 25-30% CPU - it's one of the shiny new quad-core processors and of course Saxon isn't fully taking advantage of it. The streaming process actually uses two threads, but I think one of them is doing most of the work. When I bought the box I was hoping to do some experiments on multithreading, and this job seems a good one to start on.
I haven't solved the customer's problem yet, but I've shown that there's at least one environment in which Saxon can handle the job, which is reassuring. Now it's a question of finding out what's different in their environment - or indeed, whether they can reproduce the problem systematically at all.
|
|
||||||||
|
Login
|
Transforming 20 Gigabytes
Comments
Re: Transforming 20 Gigabytes
by
Gary
on Thu 25 Oct 2007 21:42 BST | Profile | Permanent Link
I thought it might be interesting to try this out on wikipedia3 http://labs.systemone.at/wikipedia3 RDF/XML which unzipped to just under 4GB. I am using the 30 day trial for saxon. So far all I get as a result is 'java.lang.OutOfMemoryError: Java heap space' even with the most simple function similar to the documentation. I have 4GB of RAM, 2 twin core processors and am using the ant saxon-xslt task in the latest build. The ANT_OPTS is set to -Xms268M -Xmx1280M -XX:MinHeapFreeRatio=20 -XX:+HeapDumpOnOutOfMemoryError and I am on windows XP. If anyone can inform me how to configure a machine to process this file I would be grateful? Is it better to use Linux for memory allocation? Is there a way I can give Java more memory to process this?
Re: Re: Transforming 20 Gigabytes
Clearly you're not succeeding in finding the magic formula that invokes serial processing. This blog isn't the best vehicle for providing technical support: please use the saxon-help mailing list or forum on sourceforge to explain exactly what you are doing in the stylesheet, and then there's a chance I can identify what you are doing wrong. It is *very* sensitive to slight variations in the XPath expressions that you use. For 4Gb, you're not going to succeed with an in-memory transformation, so allocating more memory won't help; you need to find the formula for serial processing, in which case you need far less memory than you are giving it.
|
Search
Recent Comments
Month Archive
|
||||||