quick-xml icon indicating copy to clipboard operation
quick-xml copied to clipboard

Rework handling general entity references (`&entity;`)

Open Mingun opened this issue 1 year ago • 7 comments

This is a big change in handling general entity references and character references. Open PR early to get feedback.

With this changes we can correctly parse document

<!DOCTYPE root [
  <!ENTITY root "<root/>">
]>
&root;

as equivalent normalized document

<root/>

The updated custom_entities example shows how it would be possible to implement requirement from the specification about parsed general entities. Serde deserializer did not updated yet, because this is not trivial part and probably that will be done in another PR.

Of course, such change probably makes the performance worse, I didn't measure impact yet.

Closes #667

Mingun avatar Jun 21 '24 16:06 Mingun

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 60.69869% with 180 lines in your changes missing coverage. Please review.

Project coverage is 60.64%. Comparing base (a9391f3) to head (69d0020). Report is 51 commits behind head on master.

Files with missing lines Patch % Lines
examples/custom_entities.rs 0.00% 117 Missing :warning:
src/events/mod.rs 38.98% 36 Missing :warning:
src/reader/buffered_reader.rs 83.09% 12 Missing :warning:
src/de/mod.rs 80.76% 5 Missing :warning:
benches/macrobenches.rs 0.00% 4 Missing :warning:
src/errors.rs 0.00% 3 Missing :warning:
benches/microbenches.rs 0.00% 1 Missing :warning:
src/reader/slice_reader.rs 97.56% 1 Missing :warning:
src/writer/async_tokio.rs 0.00% 1 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #766      +/-   ##
==========================================
+ Coverage   60.21%   60.64%   +0.43%     
==========================================
  Files          41       42       +1     
  Lines       16021    16470     +449     
==========================================
+ Hits         9647     9989     +342     
- Misses       6374     6481     +107     
Flag Coverage Δ
unittests 60.64% <60.69%> (+0.43%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov-commenter avatar Jun 30 '24 13:06 codecov-commenter

I finished work on the base part of the entities support. In this PR new Event::GeneralRef is added together with new BytesRef struct which represents the any &...; reference, including entity references and character references. Character references can be resolved by call BytesRef::resolve_char_ref(), entity references can be resolved by mapping from content of BytesRef to replacement text. Both usages are shown in the updated custom_entities example.

Mingun avatar Jun 30 '24 13:06 Mingun

I won't have a chance to fully review this for a couple of days. Quick question though, am I correct in thinking that this PR will mean that any time a text block contains one or more entity references, instead of the developer receiving one Event::Text containing everything between the opening and closing tags, they will receive a series of Event::Text and Event::GeneralRef which they will then need to merge back together themselves into the original text?

dralley avatar Jun 30 '24 17:06 dralley

Yes, you are correct. But that does not mean that he/she will needed to construct the complete text themselves. In the next PR I plan to rename Reader to the RawReader and add make it return borrow-only RawEvents, and in in the another PR introduce new Reader which will automatically merge all consequent Text, CData and GeneralRef events. This should be much more convenient for the average user. RawReader will only be needed for very fine control.

Because renames affects very many places, I want to do that in a separate PR to reduce noise in PR with new Reader.

Borrow-only reader-only events will be useful also in that sense that I plan to add offset member to them to track event position in the stream. When you construct event for writing you are obviously does not have position and I think it is better to not have a dummy value for it, in order to you couldn't mistakenly use writer event in the reading context and get the wrong position.

Because new Reader will have a stack of the RawReaders (in the same way as demonstrated in custom_entities example), it will be simple recreate readers when we will need to change encoding, so I think, that #158 is very close to resolving.

Mingun avatar Jun 30 '24 18:06 Mingun

@dralley, what do you think about this?

Mingun avatar Jul 21 '24 04:07 Mingun

Sorry for not reviewing this promptly. Once the current set of PRs are merged, can you rebase?

dralley avatar Oct 12 '24 18:10 dralley

@dralley, you can review this. Initially I thought not to merge it until I implement other changes (in the follow-up PRs) that reworks the parser to reduce amount of releases with breaking changes, but because we already have enough amount of breaking changes, I think, it can be merged and included in next release.

Mingun avatar Oct 19 '24 13:10 Mingun

Totally agree. Ok, then I'll merge it when follow-up PR will be ready

Mingun avatar Nov 10 '24 10:11 Mingun

@Mingun The Reader / RawReader split is coming soon, then?

dralley avatar May 09 '25 13:05 dralley

Yes, I think so.

Mingun avatar May 10 '25 13:05 Mingun

@Mingun It looks like 0.38.0 has been released with this breaking change but without the Reader / RawReader split. Can you estimate when the Reader / RawReader split is going to be released? I'm asking so I can better assess whether to spend time to update my code and stitch together events (text, entities, CDATA) myself now or just wait for a little bit and have the new Reader already cover this.

devrosch avatar Jul 30 '25 12:07 devrosch

Right now I estimate this at least to 1 month and version with this split is more likely 0.39 or 0.40, but it's not certain 😉

Mingun avatar Jul 30 '25 17:07 Mingun

Thank you for the quick clarification!

devrosch avatar Aug 01 '25 20:08 devrosch