docx-rs icon indicating copy to clipboard operation
docx-rs copied to clipboard

quick_xml has better performance(around 50 times faster) than xml-rs

Open sxhxliang opened this issue 2 years ago • 6 comments
trafficstars

Is your feature request related to a problem? Please describe.

On a particular file, quick-xml is around 50 times faster than xml-rs crate. https://github.com/tafia/quick-xml#performance-1

A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like

A clear and concise description of what you want to happen.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

sxhxliang avatar Dec 27 '22 03:12 sxhxliang

I had a quick look, and it does not seem to be a simple drop-in replacement as both crates use a slightly different approach. Also XML-related code quickly becomes interdependent, as by nature there is a lot of nested calls.

It seems to represent a significant rework on the crate, which potential side-effect difficult to isolate. To give you an example, migrating comment_extended and comments_extended to quick-xml gave me some failing tests due to some issues with comment and comments.

I had say it would probably require writing a lot more tests, and maybe introduce a gate feature.

@bokuweb let me know if you think it is worth it.

git-noise avatar Jun 29 '24 12:06 git-noise

Hello again, I had a closer look and it seems we could use the serde capabilities of quick-xml to simplify part of the code. It will however probably be a significant/big change that will impact the crate, as it means essentially overhauling the whole serialization/deserialization process.

@bokuweb please let me know if you are open to such a change.

git-noise avatar Jul 12 '24 15:07 git-noise

@git-noise Hello. I am highly motivated to switch to quick-xml and to reduce the code, but I am concerned about breaking changes. If the changes can be minimized, I would definitely consider it.

bokuweb avatar Jul 13 '24 00:07 bokuweb

What I can do is showcase the migration gains/trade-offs on a few elements on the reader-side. Some of them seem relatively "isolated" enough that they could be good candidates. I'll put that into a branch on my fork and you can then see if it goes in the right direction or not. We can then maybe decide on /design a more comprehensive action plan?

Would that work?

git-noise avatar Jul 13 '24 14:07 git-noise

@bokuweb

Hello,

I migrated some of the code - here https://github.com/git-noise/docx-rs/tree/quick-xml - you can see some initial work done on:

  • ExtendedComments
  • WebSettings and Div

They were picked up a bit randomly because they're fairly independent and could easily be isolated. For now the reader logic - so essentially deserialization - relies on one additional Trait FromXMLQuickXml which is the pendant of FromXML for quick-xml

==== TLDR:

  • it significantly streamlines the code base and its complexity
  • it would probably require a change in the JSON serialization formatting once we implement the writer-side of things. -> While the API could remain stable, it would break the current JSON representation.

==== In more details - please note that this is based on the current code changes, and it may not hold once we migrate more modules:

Pros:

  1. The main point, is that it significantly reduces the code base as there is almost no need for manual XML parsing:
  • parse_xml() now relies on quick-xml serde features
  • consequently, ElementReader trait would probably not be needed anymore.
  • some cases require some manual parsing adjustments, but they can be handled with simpler macro.
  1. Most of the public-API seems to remains the same, which should limit breaking changes. I added some more tests, but the existing ones still pass successfully, which hints that there should not be any regressions.
  2. Maybe performance, but until we actually migrate the code-base, it may be difficult to quantify.

Cons:

  1. It is a significant change/workload and more tests would need to be written to catch regressions.
  2. It may not be the biggest issue, but MSRV for quick-xml seems to be 1.61 - I see clippy currently runs with 1.56 on docx-rs.
  3. A bigger concern will be a discrepancy between the JSON-serialization and the XML-serialization patterns. While for now I concentrated on the reader-side, a benefit down the road would be to tackle the writer-side (so the XML-serialization). To give you an example, right now Extended Comments, serialize to extendedComments in JSON and to exComments in XML (as per OOXML convention). There are also differences on how internal/children fields are represented. -> I think this would represent a breaking change for people using the JSON representation for docx files

git-noise avatar Jul 14 '24 23:07 git-noise

Hello, @bokuweb, just wanted to check if you had time to review this one

git-noise avatar Aug 10 '24 00:08 git-noise