xmltwig icon indicating copy to clipboard operation
xmltwig copied to clipboard

Segmentation fault

Open jkramer opened this issue 12 years ago • 6 comments

I'm getting a segmentation fault when parsing a relatively large XML file (289M). I'm not sure if the problem is in XML::Twig, Expat or Perl. Here's a stack trace:

` #0 0x00000000004a0320 in Perl_sv_setsv_flags () #1 0x0000000000493121 in Perl_pp_sassign () #2 0x0000000000492783 in Perl_runops_standard () #3 0x00000000004c2215 in S_docatch () #4 0x0000000000492783 in Perl_runops_standard () #5 0x0000000000434bbf in Perl_call_sv () #6 0x00007ffff2195197 in endElement () from /home/jkramer/perl5/perlbrew/perls/perl-5.16.0/lib/site_perl/5.16.0/x86_64-linux/auto/XML/Parser/Expat/Expat.so #7 0x00007ffff1f69a34 in ?? () from /usr/lib/../lib/libexpat.so.1 #8 0x00007ffff1f6ac91 in ?? () from /usr/lib/../lib/libexpat.so.1 #9 0x00007ffff1f6ca9d in XML_ParseBuffer () from /usr/lib/../lib/libexpat.so.1 #10 0x00007ffff218c8e7 in parse_stream () from /home/jkramer/perl5/perlbrew/perls/perl-5.16.0/lib/site_perl/5.16.0/x86_64-linux/auto/XML/Parser/Expat/Expat.so #11 0x00007ffff218cf70 in XS_XML__Parser__Expat_ParseStream () from /home/jkramer/perl5/perlbrew/perls/perl-5.16.0/lib/site_perl/5.16.0/x86_64-linux/auto/XML/Parser/Expat/Expat.so #12 0x0000000000499aec in Perl_pp_entersub () #13 0x0000000000492783 in Perl_runops_standard () #14 0x000000000043a015 in perl_run () #15 0x000000000041f39d in main ()

`

I'm using XML::Twig v3.41 (from http://www.xmltwig.org/xmltwig/XML-Twig-3.41.tar.gz), Perl 5.16 and expat 2.1.0-1 (package from arch linux, not sure if they patch anything or if it's built from original source).

The problem occurs after a while of parsing a large file, I can't tell yet if it's caused by a specific action. I'll post if I find out more.

jkramer avatar Oct 08 '12 15:10 jkramer

On 10/08/2012 05:38 PM, Jonas Kramer wrote:

I'm getting a segmentation fault when parsing a relatively large XML file (289M). I'm not sure if the problem is in XML::Twig, Expat or Perl. Here's a stack trace:

| |

#1 0x0000000000493121 in Perl_pp_sassign () #2 0x0000000000492783 in Perl_runops_standard () #3 0x00000000004c2215 in S_docatch () #4 0x0000000000492783 in Perl_runops_standard () #5 0x0000000000434bbf in Perl_call_sv () #6 0x00007ffff2195197 in endElement () from /home/jkramer/perl5/perlbrew/perls/perl-5.16.0/lib/site_perl/5.16.0/x86_64-linux/auto/XML/Parser/Expat/Expat.so #7 0x00007ffff1f69a34 in ?? () from /usr/lib/../lib/libexpat.so.1 #8 0x00007ffff1f6ac91 in ?? () from /usr/lib/../lib/libexpat.so.1 #9 0x00007ffff1f6ca9d in XML_ParseBuffer () from /usr/lib/../lib/libexpat.so.1 #10 0x00007ffff218c8e7 in parse_stream () from /home/jkramer/perl5/perlbrew/perls/perl-5.16.0/lib/site_perl/5.16.0/x86_64-linux/auto/XML/Parser/Expat/Expat.so #11 0x00007ffff218cf70 in XS_XML__Parser__Expat_ParseStream () from /home/jkramer/perl5/perlbrew/perls/perl-5.16.0/lib/site_perl/5.16.0/x86_64-linux/auto/XML/Parser/Expat/Expat.so #12 0x0000000000499aec in Perl_pp_entersub () #13 0x0000000000492783 in Perl_runops_standard () #14 0x000000000043a015 in perl_run () #15 0x000000000041f39d in main () |

I'm using XML::Twig v3.41 (from http://www.xmltwig.org/xmltwig/XML-Twig-3.41.tar.gz), Perl 5.16 and expat 2.1.0-1 (package from arch linux, not sure if they patch anything or if it's built from original source).

The problem occurs after a while of parsing a large file, I can't tell yet if it's caused by a specific action. I'll post if I find out more.

Did you try parsing the file with just XML::Parser, something like perl -MXML::Parser -E'XML::Parser->new->parsefile( "foo.xml")'

If you still get a segfault, then the problem is probably not in XML::Twig.

mirod

mirod avatar Oct 08 '12 16:10 mirod

I tried your XML::Parser line with the file and didn't get a seg. fault. Also it took only 2 seconds and memory usage didn't even show up in top, while with XML::Twig it crashed after about 3 minutes and using up over 60% of memory (~4.7G). I've assembled a tar package with the test scripts to reproduce the error, a sample XML file and the error output here: http://pastelink.me/dl/9f829a (careful please, the anonymized sample XML in there will be 200+M when decompressed even though the tar is just about 400K).

jkramer avatar Oct 09 '12 09:10 jkramer

I just wanted to file a bug report about a seg. fault in XML::Twig when I found that I already did 5 months ago, so I'm just adding some new data here and hope you can look into it. Here's a small sample script that reproduces the seg. fault for me: http://dpaste.com/1031197/. Here's the sample XML (~130M): http://privatepaste.com/download/bb1209e5f9 (better download with wget/curl since the site doesn't seem to deliver the content type bz2 correctly). The problem doesn't occur with smaller files than that and it seems to be related to how much memory is available to the script. I also update to the latest versions XML::Twig (latest stable and development) and perl 5.16.3, didn't help.

jkramer avatar Mar 22 '13 13:03 jkramer

On 03/22/2013 02:00 PM, Jonas Kramer wrote:

I just wanted to file a bug report about a seg. fault in XML::Twig when I found that I already did 5 months ago, so I'm just adding some new data here and hope you can look into it. Here's a small sample script that reproduces the seg. fault for me: http://dpaste.com/1031197/. Here's the sample XML (~130M): http://privatepaste.com/download/bb1209e5f9 (better download with wget/curl since the site doesn't seem to deliver the content type bz2 correctly). The problem doesn't occur with smaller files than that and it seems to be related to how much memory is available to the script. I also update to the latest versions XML::Twig (latest stable and development) and perl 5.16.3, didn't help.

— Reply to this email directly or view it on GitHub https://github.com/mirod/xmltwig/issues/4#issuecomment-15295303.

You script loads the entire XML in memory. So if the XML is large enough, you run out of memory.

If you want to process files that don't fit in memory, then you need to use the tools that XML::Twig gives you, like twig_roots, or flush/purge.

Does this help?

mirod

mirod avatar Mar 22 '13 14:03 mirod

I don't think that's the problem because the seg. fault occurs only after parsefile() has finished and the XML has been parsed completely (and successfully). I think it occurs when freeing the memory allocated by twig at destruction time. While it's true that purging the nodes in the example script fixes the seg. fault, it doesn't fix the problem in other scripts. That's why I didn't use purge in the example script, in order to reproduce the crash.

jkramer avatar Mar 22 '13 14:03 jkramer

This sort of sounds like a problem I've run against too. I thought I filed a bug on it, but don't see it here or on cpan.

I tracked the crash down to Twig's recursive deallocation algorithm. When destroying a tree, it walks down to the deepest element, destroys it and then works its way back up the stack destroying each terminal node. For very large trees, then, you can end up overflowing your process's stack limits. To work around this, when working with very large trees, I walk the tree myself looking for nodes with very large numbers of children and then manually 'cut' them out of the tree without the recursion.

pdwilson12 avatar Jul 03 '13 15:07 pdwilson12