quick-xml icon indicating copy to clipboard operation
quick-xml copied to clipboard

Proper normalize attribute value normalization

Open dralley opened this issue 3 years ago • 14 comments

closes #371

dralley avatar Apr 03 '22 01:04 dralley

Suggestions :

  • Move this functionality directly to Attribute
  • Provide a fast path to get the raw value

TODO:

  • Character reference & entity reference substitution with associated error handling
    • Figure out what the API needs to look like

dralley avatar Apr 03 '22 15:04 dralley

Codecov Report

Merging #379 (538e5cd) into master (e701c4d) will increase coverage by 0.10%. The diff coverage is 86.95%.

:exclamation: Current head 538e5cd differs from pull request most recent head ac7b67b. Consider uploading reports for the commit ac7b67b to get more accurate results

@@            Coverage Diff             @@
##           master     #379      +/-   ##
==========================================
+ Coverage   61.37%   61.48%   +0.10%     
==========================================
  Files          20       20              
  Lines       10157    10229      +72     
==========================================
+ Hits         6234     6289      +55     
- Misses       3923     3940      +17     
Flag Coverage Δ
unittests 61.48% <86.95%> (+0.10%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/errors.rs 9.52% <ø> (-2.85%) :arrow_down:
src/escapei.rs 13.90% <0.00%> (ø)
src/reader.rs 88.36% <ø> (+0.94%) :arrow_up:
src/events/attributes.rs 94.12% <88.88%> (+3.45%) :arrow_up:
src/lib.rs 21.09% <0.00%> (-4.92%) :arrow_down:
src/de/escape.rs 65.15% <0.00%> (-1.28%) :arrow_down:
src/de/seq.rs 91.83% <0.00%> (-0.76%) :arrow_down:
src/se/mod.rs 93.81% <0.00%> (-0.01%) :arrow_down:
src/writer.rs 90.36% <0.00%> (+0.02%) :arrow_up:
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e701c4d...ac7b67b. Read the comment docs.

codecov-commenter avatar Jun 20 '22 19:06 codecov-commenter

I plan to continue working on this over the next week or two

dralley avatar Jun 22 '22 17:06 dralley

@Mingun Questions:

Currently we have the functions unescaped_value, unescaped_value_with_custom_entities and their decode equivalents, that do the escaping part but don't implement the rest of the XML attribute-value-normalization spec. I'm not sure I see any reason for those to continue to exist as far as XML is concerned, but for HTML it makes some sense, as HTML only seems to do unescaping without any other normalization of the value.

  1. Does that sound accurate / align with your knowledge
  2. Do you think it would make more sense to stick with functional names, like normalized_value / unescaped_value and rely on the documentation to tell users what they ought to be using, or switch more descriptive names such as html_value / xml_value
  3. The behavior of Attributes depends on whether or not .html is set, should we look at doing something similar here, or would that not be worth the additional complication

dralley avatar Jun 23 '22 02:06 dralley

  1. I haven't studied a situation about HTML attributes, therefore, I rely on your understanding of the situation. Then, if you include references to relevant resources in the documentation, I'll be able to learn something about that
  2. I think that functional names are better
  3. Probably we should inverse things and introduce different types for XML / HTML attributes, then implement only relevant methods on each. This will also solve some unpleasant things in the current API -- we can change htmlity of the attributes in the middle of iteration that probably could to lead to sophisticated bugs. I would like to avoid such dangerous usage.

Mingun avatar Jun 23 '22 18:06 Mingun

I'm basically going off of the lack of any kind of discussion of attribute value normalization in the HTML living spec, and this discussion on stackoverflow

https://html.spec.whatwg.org/multipage/dom.html#attributes https://html.spec.whatwg.org/multipage/syntax.html#attributes-2 https://stackoverflow.com/questions/63906320/html5-attribute-value-normalization

I think that functional names are better

I agree

Probably we should inverse things and introduce different types for XML / HTML attributes, then implement only relevant methods on each. This will also solve some unpleasant things in the current API -- we can change htmlity of the attributes in the middle of iteration that probably could to lead to sophisticated bugs. I would like to avoid such dangerous usage.

Can they? It looks like all these fields are private. But, I feel like this is still the best option. There is already attributes() and html_attributes(), it's only a matter of changing the types.

The unfortunate thing will be, that the code between the two is almost the same, just enough so that it will be really annoying to duplicate.

dralley avatar Jun 25 '22 03:06 dralley

Apologies on the lack of forward progress, I haven't been able / willing to devote a lot of extra time to this (or the encoding work) in a while.

@Mingun The partially-but-not-completely recursive aspect of the algorithm in step 3 has ended up being a massive timesink, and I haven't yet gotten it to work. For the sake of making some forward progress, would you be OK with an implementation that handles all of the parts of the algorithm except for the recursive resolution of entity references, and a TODO item for the recursive aspect?

I have no issues with making the other adjustments (such as adding examples), though I'd rather handle the structural API changes separately from this PR.

dralley avatar Jan 24 '23 05:01 dralley

Well, it's still in draft for a reason :) I'll clean it up before asking for a review

dralley avatar Jan 25 '23 15:01 dralley

For your reference -- recently discovered an embryonic project for testing XML parser implementations to the standard's conformance: https://github.com/JOSEPHGILBY/xml-conformance-rs. Not looked at it yet, but seems encouraging.

Looks nice, it may even make sense to integrate the test code into our test suite.

dralley avatar Jan 27 '23 04:01 dralley

I did a cursory review of the changes and it looks fine. Would you prefer the commits squashed or kept separate?

I'll address everything else tomorrow.

dralley avatar Jan 30 '23 06:01 dralley

I prefer to kept separate. It is somehow psychologically uncomfortable for review when one commit changes more than ~200 lines in each (or just several) files, even if half of them -- new tests. ¯_(ツ)_/¯

I think, that at least separating normalization method to it's own commit would be a good idea. This is pretty isolated thing which, however, is big enough. That commit needed to be updated:

  • Add tests for non-ASCII input
  • Introduce new error kind and return it when depth become 0. That means that we reach recursion limit
  • (could be postponed) I would like to have limit configurable
  • (could be postponed) Add a way to explicitly detect recursion (i.e. track the resolved entities and report which entity was defined recursively)
  • (could be postponed) Add an ability to pass metainfo around entities. That way we can provide a way to report in error where the erroneous entity is defined, if resolver function provide that information

Mingun avatar Jan 30 '23 14:01 Mingun

Introduce new error kind and return it when depth become 0. That means that we reach recursion limit

(could be postponed) I would like to have limit configurable

How should one configure the limit? At some point it becomes unwieldy to keep all of this state external and provide it in each method call (we also have the XML / HTML divergence in attribute handling to consider).

Should we consider keeping the state in Reader and doing something along the lines of

  • reader.normalize_attribute_value(attr), or
  • attr.normalize_value_with(resolve_entity, reader) or attr.normalize_value_with(reader) (moving the resolve_entity into Reader entirely)?

Also I've noticed that some implementations detect an entity loop immediately instead of processing until the recursion limit is reached. Should that be two separate errors in your opinion, or one error?

dralley avatar Nov 12 '23 16:11 dralley

@Mingun ^

dralley avatar Nov 15 '23 05:11 dralley