docx-rs
docx-rs copied to clipboard
quick_xml has better performance(around 50 times faster) than xml-rs
Is your feature request related to a problem? Please describe.
On a particular file, quick-xml is around 50 times faster than xml-rs crate. https://github.com/tafia/quick-xml#performance-1
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
I had a quick look, and it does not seem to be a simple drop-in replacement as both crates use a slightly different approach. Also XML-related code quickly becomes interdependent, as by nature there is a lot of nested calls.
It seems to represent a significant rework on the crate, which potential side-effect difficult to isolate. To give you an example, migrating comment_extended and comments_extended to quick-xml gave me some failing tests due to some issues with comment and comments.
I had say it would probably require writing a lot more tests, and maybe introduce a gate feature.
@bokuweb let me know if you think it is worth it.
Hello again, I had a closer look and it seems we could use the serde capabilities of quick-xml to simplify part of the code. It will however probably be a significant/big change that will impact the crate, as it means essentially overhauling the whole serialization/deserialization process.
@bokuweb please let me know if you are open to such a change.
@git-noise Hello. I am highly motivated to switch to quick-xml and to reduce the code, but I am concerned about breaking changes. If the changes can be minimized, I would definitely consider it.
What I can do is showcase the migration gains/trade-offs on a few elements on the reader-side. Some of them seem relatively "isolated" enough that they could be good candidates. I'll put that into a branch on my fork and you can then see if it goes in the right direction or not. We can then maybe decide on /design a more comprehensive action plan?
Would that work?
@bokuweb
Hello,
I migrated some of the code - here https://github.com/git-noise/docx-rs/tree/quick-xml - you can see some initial work done on:
- ExtendedComments
- WebSettings and Div
They were picked up a bit randomly because they're fairly independent and could easily be isolated.
For now the reader logic - so essentially deserialization - relies on one additional Trait FromXMLQuickXml which is the pendant of FromXML for quick-xml
==== TLDR:
- it significantly streamlines the code base and its complexity
- it would probably require a change in the JSON serialization formatting once we implement the writer-side of things. -> While the API could remain stable, it would break the current JSON representation.
==== In more details - please note that this is based on the current code changes, and it may not hold once we migrate more modules:
Pros:
- The main point, is that it significantly reduces the code base as there is almost no need for manual XML parsing:
- parse_xml() now relies on quick-xml serde features
- consequently,
ElementReadertrait would probably not be needed anymore. - some cases require some manual parsing adjustments, but they can be handled with simpler macro.
- Most of the public-API seems to remains the same, which should limit breaking changes. I added some more tests, but the existing ones still pass successfully, which hints that there should not be any regressions.
- Maybe performance, but until we actually migrate the code-base, it may be difficult to quantify.
Cons:
- It is a significant change/workload and more tests would need to be written to catch regressions.
- It may not be the biggest issue, but MSRV for quick-xml seems to be 1.61 - I see clippy currently runs with 1.56 on docx-rs.
- A bigger concern will be a discrepancy between the JSON-serialization and the XML-serialization patterns. While for now I concentrated on the reader-side, a benefit down the road would be to tackle the writer-side (so the XML-serialization). To give you an example, right now Extended Comments, serialize to extendedComments in JSON and to exComments in XML (as per OOXML convention). There are also differences on how internal/children fields are represented. -> I think this would represent a breaking change for people using the JSON representation for docx files
Hello, @bokuweb, just wanted to check if you had time to review this one