quick-xml icon indicating copy to clipboard operation
quick-xml copied to clipboard

New parser implementation

Open Mingun opened this issue 7 months ago • 5 comments

The draft of the new parser implementation, based on approach used in quickest XML parser implementation according to our tests -- maybe_xml.

This PR is created to get early feedback. The new parser comes from my attempt to implement correct parsing of DTD. When I wrote DTD parser I found that compiling quick-xml takes a little more time than I would like, so I implemented it in the separate crate. this has advantages and disadvantages:

  • 👍 There are no crates for correct DTD parsing so we could occupy the free niche;
  • 👍 More fast development;
  • 👎 Potentialy a slight performance regression because compiler could not be able to inline functions ccross crate boundaries, but enabling LTO should fix that problem;
  • 👎 Potentialy more hard to maintain two crates instead of one

The new quick-dtd crate currently included in quick-xml source tree just for easy prototyping. In the final PR it should stay completely independent.

Things that need to finish:

  • [ ] Decide if we want to make quick-dtd as a separate crate or implement DTD parsing in a dtd module of quick-xml. Of course, I plan to give access of all maintainers of quick-xml to the new crate so they does not lost control over this crate. Is it better to keep DTD parsing in quick-xml? What do you think?
  • [ ] Make tests pass. I think, that we should not follow current tests assumptions in all cases, some tests should be adjusted instead
  • [ ] Run benchmarks. I expect that performance should improve, but did not yet check that. It should be possible even in the current state. The failed tests are all about errors reporting (note that debug printing in the last commit slows down the parser A LOT, especially asynchronous).
  • [ ] Use memchr where possible, but maybe that would unnecessary. Need to check benchmarks
  • [ ] Check if this fixes some bugs from bugtracker

Mingun avatar Nov 29 '23 20:11 Mingun

Obviously this will take some time to review fully, I'll wait until it's ready unless there's something in particular you'd like me to look at.

FYI, I'm not sure if you've seen your email but there was a party that was very interested in resolving https://github.com/tafia/quick-xml/issues/285

dralley avatar Nov 29 '23 21:11 dralley

Main question now -- do you agree that DTD parsing can be moved to it's own crate, or you prefer to keep it as a module inside quick-xml? If use the separate crate, I'll extract it from this PR and publish separately and then add a dependency.

Yeah, I've seen the email, just not found time to answer yet... but I will definitely find.

Mingun avatar Nov 29 '23 21:11 Mingun

do you agree that DTD parsing can be moved to it's own crate, or you prefer to keep it as a module inside quick-xml?

The only reason I see for splitting functionality into a separate crate would be if it has significant utility as a library independent from the original library, or perhaps if it makes a significant difference in compile times.

It's not enough code for compile time to be a major concern and contains no additional dependencies, and I struggle to see a DTD library being useful independent of an XML library.

I suppose if it could be easily used with a different XML library, then that might be a sufficient reason? But otherwise I would just keep it as a module.

dralley avatar Nov 29 '23 21:11 dralley

It's not enough code for compile time to be a major concern and contains no additional dependencies, and I struggle to see a DTD library being useful independent of an XML library.

Replying to myself here: you mentioned both of these in your comment but I forgot to properly follow up on that.

You said you actually did notice a compile time difference - how much?

I saw a few crates such as https://crates.io/crates/dtd-parser which do DTD parsing, but I didn't look into how good or "correct" they are.

Honestly, I don't care that much either way whether it is separate crate or module in quick-xml. I have a general preference for doing the simple thing initially unless there's an compelling reason not to, though.

dralley avatar Dec 02 '23 02:12 dralley

You said you actually did notice a compile time difference - how much?

I was not clear enough. The point is that if create a new module inside quick-xml then I need to wait until the whole quick-xml will be compiled before I can see the difference about changes that I've made in a new module. quick-xml now takes about 10 seconds to compile on my machine. The new crate takes only 1-2 seconds. Therefore, it is simply more convenient to develop a very autonomous part of the parser in a separate crate.

I've seen https://crates.io/crates/dtd-parser and while it uses high-performant nom library I decided to not use it because:

  • it is not allocation-free. Read data is stored as Strings which is not right for us if we want to be fast
  • I think, it has too many dependencies for a such simple task. quick-xml right now has only one required dependency without transitive dependencies and I think this is great. I would not like to increase this number too much

Honestly, I don't care that much either way whether it is separate crate or module in quick-xml. I have a general preference for doing the simple thing initially unless there's an compelling reason not to, though.

Ok. Then I think I keep the current approach and implement DTD parser in it's own crate for now. Then it will be simpler to enhance it

Mingun avatar Dec 02 '23 18:12 Mingun