xml-rs icon indicating copy to clipboard operation
xml-rs copied to clipboard

Performance is not comparable to other XML parsing libraries

Open conradev opened this issue 8 years ago • 8 comments

I build and maintain a library for parsing property list files in Rust, plist-rs, and I created benchmarks to compare it to the other common plist parsing libraries:

$ rustup run nightly cargo bench --features libplist
     Running target/release/comparison-1b39fc719adbc926

running 6 tests
test foundation::bench_binary ... bench:   2,214,275 ns/iter (+/- 785,637)
test foundation::bench_xml    ... bench:   7,600,543 ns/iter (+/- 1,284,842)
test libplist::bench_binary   ... bench:   4,147,479 ns/iter (+/- 1,656,727)
test libplist::bench_xml      ... bench:  13,847,505 ns/iter (+/- 6,601,819)
test rust::bench_binary       ... bench:   2,303,294 ns/iter (+/- 1,778,663)
test rust::bench_xml          ... bench:  32,686,229 ns/iter (+/- 5,390,257)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured

The XML property list parser in plist-rs is based on xml-rs, and as you can see it is twice as slow as libplist which uses libxml and four times as slow as NSPropertyListSerialization (Apple's implementation) which uses a custom XML parser.

Just wanted to open this as a tracking issue to investigate where the issues are.

conradev avatar Jun 09 '16 21:06 conradev

Thanks! Indeed, I did not invest heavily to the performance improvements yet. My goal is to provide a standards-compliant parser first. However, I'm not really surprised that libxml2 and Apple's implementations are faster :) After all, they are written in C, and at least libxml2 is bound to be heavily optimized manually.

netvl avatar Jun 10 '16 07:06 netvl

Thanks for making this crate, we've got a lot of miles out of it in Rusoto. 👍

Any thoughts on how best to contribute to improving performance of xml-rs?

matthewkmayer avatar Mar 14 '18 06:03 matthewkmayer

@matthewkmayer I see several directions for optimization.

  1. xml-rs has been first created ages ago, long before the first stable version of Rust was available. Therefore some details of its API are not really up-to-date. In particular, xml-rs allocates a lot. Ideally, it should work like quick-xml does, i.e. reading data to its internal buffer and give out references to it.
  2. The parser-lexer combination like it is done here is not ideal. I think it should be possible to optimize the state machine there quite a lot. Maybe it makes sense to get rid of the lexer entirely.

Other than that, unfortunately, I don't have any pointers. I really suck at performance optimizations :( For example, I'd suspect that it may be possible to use SIMD operations, because I heard that other parsers like for JSON can get a lot out of it, but I don't know for sure, since I have never investigated this field.

Ideally, the first step should be revamping the API so it no longer allocates that much. Supporting encodings other than UTF-8 should likely be a part of it, because I think that integration of encoding handling could affect the design significantly. There are beginnings of what I want to do in the "parser-rearchitecture" branch, but I'm afraid I don't have enough time to work on it right now.

netvl avatar Mar 19 '18 05:03 netvl

Ideally, it should work like quick-xml does, i.e. reading data to its internal buffer and give out references to it.

I don't know how to ask these questions exactly, but I'll try anyway. Please forgive my bluntness.

What does or should xml-rs do better than quick-xml?

Is it worth redesigning xml-rs's interface, or is quick-xml a better starting point now? xml-rs was first, filling a gap in the Rust ecosystem, and I'm grateful for that. It's totally understandable that its pre-Rust 1.0 API isn't where you want it to be today. If another crate is much closer, I think there's no shame in deprecating xml-rs. On the other hand, maybe xml-rs has some other significant advantage over quick-xml (standards compliance?), and it'd aid my understanding to spell that out as I decide what crate to base my work on...

I want to emphasize that I have the utmost respect for your work, whatever the answer to the questions above.

scottlamb avatar Nov 12 '21 18:11 scottlamb

@scottlamb I'm out of the Rust ecosystem for a long time already, so I don't know the state of other crates, unfortunately. Last time I checked, none of the other XML crates were focused on getting full compliance with the XML spec or even declared that to be their goal (xml-rs was intended to be as compliant as possible, but unfortunately I don't think it is there right now, primarily because I was not able to invest effort in it for a long time, and still cannot), and it was the primary "selling point" of this library.

Unfortunately, unless someone else is willing to take over maintenance, xml-rs will become effectively deprecated anyway. As it stands now, its API is indeed quite obsolete, there are performance issues and some bugs, so it is highly likely that you would want to look into other libraries first.

netvl avatar Nov 12 '21 19:11 netvl

Thanks! I'll look into quick-xml and show up here again if I find a deal-breaker with it...

scottlamb avatar Nov 12 '21 20:11 scottlamb

I'm back. :-/ I took a look at the other crates and didn't see anything more promising than xml-rs:

  • quick-xml doesn't look like it will ever be a standards-compliant parser. I asked the author about their goals here and haven't heard back (yet) but my feel from what I've seen is that he doesn't want to check well-formedness if it degrades performance even a little. There's probably a fair number of people who like that approach, but I personally am more interested in a more standards-compliant library like xml-rs.
  • xmlparser is a (very nice but) limited library. Not only is it (in the author's words) a "low-level XML tokenizer that preserves the positions of the tokens and is not intended to be used directly" but it only parses a &str, so it can't do streaming or handle other encodings. So I don't think it's suitable as a replacement for xml-rs or even as a base for xml-rs.
  • xml5ever explicitly (in its author's words) is "alpha quality" and "trades well-formedness for error recovery"
  • sxd_document only operates on a DOM; there's no lower-level streaming API.
  • rapid-xml looks like something used internally within a company and open-sourced without much effort at community engagement, eg the only issue went unanswered. It has really interesting sophisticated SIMD-based logic but for small documents appears to be slower than quick-xml despite all that, I think in large part because of the overhead of the syscalls involved in its slice-deque usage. And slice-deque is a (really cool but) unmaintained crate with known soundness problems.

For my current use case (configuring IP cameras via ONVIF), I don't really need high performance. This is control plane stuff, not data plane stuff. So I think I'm going to stick with xml-rs for now.

Unfortunately, unless someone else is willing to take over maintenance, xml-rs will become effectively deprecated anyway.

fwiw, I'm willing to put in a little time if you'd find it helpful. You probably don't want to just hand the reigns over to me given that (a) you don't know me, (b) XML is sort of tangential to my projects rather than something I should be devoting a huge amount of time to improving the efficiency of. But I'd be happy to say author a PR when there's a severe bug, or take a first pass look at other PRs, etc. if that helps.

scottlamb avatar Nov 16 '21 19:11 scottlamb

Since Rust already has some XML parsers that are fast, but incomplete and/or inflexible, I think this crate could focus on conformance and ease of use instead.

kornelski avatar May 10 '23 22:05 kornelski