sax-machine icon indicating copy to clipboard operation
sax-machine copied to clipboard

Large xml file seems to not be "streaming", eatings GBs of Ram

Open jakeonrails opened this issue 13 years ago • 22 comments

I have a 1.6gb xml file, and when I parse it with Sax Machine it does not seem to be streaming or eating the file in chunks - rather it appears to be loading the whole file into memory (or maybe there is a memory leak somewhere?) because my ruby process climbs upwards of 2.5gb of ram. I don't know where it stops growing because I ran out of memory.

On a smaller file (50mb) it also appears to be loading the whole file. My task iterates over the records in the xml file and saves each record to a database. It takes about 30 seconds of "idling" and then all of a sudden the database queries start executing.

I thought SAX was supposed to allow you to work with large files like this without loading the whole thing in memory.

Is there something I am overlooking?

Many thanks,

@jakeonrails

jakeonrails avatar Feb 08 '12 19:02 jakeonrails

@jakeonrails If this issue is still on your radar, can you see if the same problem occurs with current master? The behavior you describe does sound like a bug, but before digging deeper, I'd like to verify the issue wasn't resolved by upgrading Nokogiri to v1.5

ezkl avatar May 29 '12 22:05 ezkl

@ezkl I can't take time to test this right now but I will try to test in the next couple days. I hope it does work now, since the parsing with sax-machine was a lot cleaner than what I resorted to back a few months ago, which is to use a monkey patched nokogiri Reader to parse out the chunk of XML for the node I want and pass that to sax-machine.

jakeonrails avatar May 29 '12 23:05 jakeonrails

@jakeonrails What method are you using to load the XML file?

ezkl avatar May 30 '12 00:05 ezkl

@ezkl I ended up using this technique here: http://stackoverflow.com/a/9223767/586983

You can see my original code which spurred me to write this github issue at the top of that SO question.

jakeonrails avatar May 30 '12 00:05 jakeonrails

@jakeonrails Thanks for the link and background info. I had a bit of a brain fart yesterday. Don't bother testing against HEAD at the moment. A streaming interface was implemented by @gregwebs, but his work wasn't ever merged into master (see: #18 and #24). I've been using a fork that includes Greg's work in production without issue for nearly a year, but never with XML files quite as large as yours. Once I've finished merging Greg's work, I'd love to get your feedback on performance w/ large files.

ezkl avatar May 30 '12 08:05 ezkl

I have been using my fork on files of about that size in production.

gregwebs avatar May 30 '12 14:05 gregwebs

we need these large file support, heroku 512mb workers really struggle with large xml parsing.

++1 on the merge of #18 #24

speedmax avatar Jun 01 '12 13:06 speedmax

so now we have 2 issues open, probably should close one

gregwebs avatar Jun 01 '12 14:06 gregwebs

Has this been merged?

chtrinh avatar Jun 06 '13 07:06 chtrinh

+1 Though - is the merge both straightforward and non-controversial?

mrjcleaver avatar Aug 03 '13 13:08 mrjcleaver

I suppose the another question might be, would pauldix prefer that gregwebs be made the new maintainer of the ruby gem? It's a bit confusing having multiple versions.

mrjcleaver avatar Aug 03 '13 13:08 mrjcleaver

I will not be a maintainer, but @ezkl might

gregwebs avatar Aug 03 '13 16:08 gregwebs

@ezkl? Would you be willing & able? @pauldix - what's your preference? Thx, M.

mrjcleaver avatar Aug 03 '13 17:08 mrjcleaver

What's the status of this? What can I do to help with getting this merged from @gregwebs's branch? @ezkl?

jmehnle avatar Sep 23 '13 23:09 jmehnle

If someone submits a PR I'll merge it in.

pauldix avatar Sep 25 '13 15:09 pauldix

Hi, I opened this PR #47, I think it should solve the problem. Feedback is well appreciated.

eljojo avatar Nov 28 '13 04:11 eljojo

That is solving a different use case then what I had. My branch allows for giant streaming collections

gregwebs avatar Nov 28 '13 14:11 gregwebs

From my point of view, the main point to using the SAX interface is streaming, rather than reading into memory at once. Does current sax-machine release support any kind of streaming? No, I think? Curious what uses people have for SAX without streaming, but that's another topic I guess.

jrochkind avatar Dec 02 '13 19:12 jrochkind

Has this issue been abandoned?

doomspork avatar Jun 15 '14 19:06 doomspork

It looks like all effort made by @ezkl and @gregwebs is left way behind the current master, so it's not possible to review/merge these changes.

I don't feel like streaming features will be added to sax-machine in nearest future, unless someone would be willing to reimplement/port that. So it basically stays usable for small xml files, especially for RSS/Atom feeds by using feedjira.

For streaming, I'd suggest to consider using Nokogiri SAX or Ox SAX.

krasnoukhov avatar Jul 17 '14 14:07 krasnoukhov

I'm using sax-machine-1.2.0 and nokkogiri-1.6.3, parsing a 1GB xml file by passing an IO object to a SAXMachine parser, and it appears to be streaming, seeing as the virtual memory usage of the process doesn't go above 100MB.

open('/huge_soap.xml') { |f| MySAXMachine.parse(f) }

The mixin passes xml_text to the handler, and the Nokogiri handler passes it directly to the backend parser, so I think that as long as Nokogiri's SAX parser streams the given IO object---and it appears to---then it's all good.

I didn't test with Ox or Oga, but it looks like the Ox handler expects xml_text to be a string and can't currently stream from an IO. I don't see an obvious reason why making a StringIO from xml_text shouldn't be conditional so an IO can be passed to SAXMachine#parse when using Ox as the backend parser.

torbiak avatar Jan 09 '15 22:01 torbiak

@torbiak Thanks for looking into this. In regard of IO parsing you're totally correct, I've changed Ox handler to support both String and IO, please see attached commit.

I'm wondering why you're getting such good memory footprint results. Can I see a full example you're running? I'm thinking about putting some benchmarks together so your example could help.

krasnoukhov avatar Jan 11 '15 13:01 krasnoukhov