XML icon indicating copy to clipboard operation
XML copied to clipboard

Really bad performance

Open demanuel opened this issue 4 years ago • 10 comments

Hi!

I'm trying to parse a 52 MByte XML file and the performance is really bad.

I'm trying to follow the instructions and just doing:

my $XML = 'ec_inventory_en.xml';
sub MAIN(){
    my $xml = from-xml-file($XML);
}

This code will use more than 5Gbytes of memory [1], only one core is used [2] and it takes more than 3m30s (in comparison a perl version takes around 15 seconds to parse the file)

[1] - Reported by cat /proc/$PID/smaps | grep -i pss | awk '{Total+=$2} END {print Total/1024" MB"}' [2] - htop image

demanuel avatar Apr 04 '21 11:04 demanuel

Hi, This module is written in pure Raku and is likely to suffer in comparison to parsers in other languages that may use a C library to do the same thing.

If you are concerned about performance then you may want to consider the LibXML binding https://modules.raku.org/dist/LibXML:cpan:WARRINGD

jonathanstowe avatar May 29 '21 07:05 jonathanstowe

I understand that, but taking 3m30s and 5GiB just to load a 50MByte file.... Is not OK.

demanuel avatar May 29 '21 14:05 demanuel

Sure, My intuition is that the memory usage is due to a very large number of objects, and the speed is due to the allocation of those objects. I guess someone could profile the code to determine if there are unnecessary objects being created and retained.

Another line of attack would be to profile the performance of the Grammar on its own (that is without the action that builds the object tree,) and see if any improvements can be made there.

There are probably some micro-optimisations that could be made within the code but you probably wouldn't want to start on that until profiling has revealed the places that would be of benefit.

jonathanstowe avatar May 29 '21 16:05 jonathanstowe

It's good to see it's not just me. I wanted to use XML::Entity::HTML that depends on this module. Turns out that out of 160 seconds of rendering a 1MB HTML on 4 cores, 140 went straight to escaping tags, when the named HTML barely had any tags to escape in the first place!

2colours avatar Jul 22 '22 18:07 2colours

This happens because of the way this module was structured. On the one hand, it's a great example of some very cool Raku features but… they're also ones that haven't been very well optimized. Lots of but, Proxy, runtime references like ::('Foo') etc, that will slow stuff down substantially.

Won't help with memory, but should help with speed: most classes will use method-call syntax for attributes ($.foo) when they really can reference the attribute directly ($!foo). I don't think the optimizer is smart enough to optimize that away.

This can almost certainly be optimized by a LOT, but I'm not sure if it can be done while maintaining 100% backwards compatibility. Guess I'll give it a try.

alabamenhu avatar Feb 05 '23 01:02 alabamenhu

It's certainly mostly my fault the XML module is slow as molasses in January. When I first worked on this in 2010, I wanted to try using all of the cool language features that had drawn me to Raku in the first place, and had more emphasis on that than on performance. Subsequent updates only focused on trying even more cool new features.

I'd planned on eventually writing an add-on extension using LibXML2 bindings (I see there is at least one module doing that now) but using the simpler API this module provides.

All of the amazing developers who have worked on this since I abandoned my Raku libraries a decade ago, have improved it substantially, and they are all saints for working on the convoluted codebase I left behind.

supernovus avatar Feb 21 '23 16:02 supernovus

@supernovus don't worry - it's really much better that you have left a lot of stuff to work with/on, than just silently abandoning them. Also, the code isn't all that bad really... when I started looking into it, what struck me is the "builder pattern" everywhere. I thought that would be an immediate and straight-forward place for improvements - but then @alabamenhu started actually making changes and reported that there aren't really easy gains with an eager system - in which case I also wouldn't say it's really your fault.

@alabamenhu are you planning to adopt this module, by the way? Not pressuring you in either direction, just curious. And if you don't, maybe your changes could be merged back into this repo, with a new version published perhaps.

2colours avatar Feb 21 '23 17:02 2colours

@alabamenhu are you planning to adopt this module, by the way? Not pressuring you in either direction, just curious. And if you don't, maybe your changes could be merged back into this repo, with a new version published perhaps.

Not sure if I'll adopt it per se, but I'll see what I can do to work with it. One important thing to consider here: this is a pure Raku module, and that has major value even if a wrapper for LibXML would be faster. There's no guarantee that LibXML will be available on any given system, so a fully vanilla Raku module is a good thing.

One thing that MIGHT be faster potentially is what I did for parsing number format strings in Intl::Format::Number, which is to integrate the actions into the grammar. XML is unambiguous as it moves forward (so it can be made entirely out of tokens) which provides some real opportunities for improvements. It will probably be a few months before I can work out something along those lines though.

alabamenhu avatar Feb 22 '23 12:02 alabamenhu

While looking into #68 I found that the <xmldecl>? and <doctypedecl>? allow catastrophic backtracking behaviour when the document in total is not well-formed. I'm not sure if changing that makes the performance better for well-formed documents, but tracing what the grammar does exactly might be a good first step to improving the performance.

timo avatar Feb 23 '25 00:02 timo

Worked on this over in #73 a little bit.

timo avatar Feb 26 '25 19:02 timo