Proper normalize attribute value normalization
closes #371
Suggestions :
- Move this functionality directly to
Attribute - Provide a fast path to get the raw value
TODO:
- Character reference & entity reference substitution with associated error handling
- Figure out what the API needs to look like
Codecov Report
Merging #379 (538e5cd) into master (e701c4d) will increase coverage by
0.10%. The diff coverage is86.95%.
:exclamation: Current head 538e5cd differs from pull request most recent head ac7b67b. Consider uploading reports for the commit ac7b67b to get more accurate results
@@ Coverage Diff @@
## master #379 +/- ##
==========================================
+ Coverage 61.37% 61.48% +0.10%
==========================================
Files 20 20
Lines 10157 10229 +72
==========================================
+ Hits 6234 6289 +55
- Misses 3923 3940 +17
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 61.48% <86.95%> (+0.10%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/errors.rs | 9.52% <ø> (-2.85%) |
:arrow_down: |
| src/escapei.rs | 13.90% <0.00%> (ø) |
|
| src/reader.rs | 88.36% <ø> (+0.94%) |
:arrow_up: |
| src/events/attributes.rs | 94.12% <88.88%> (+3.45%) |
:arrow_up: |
| src/lib.rs | 21.09% <0.00%> (-4.92%) |
:arrow_down: |
| src/de/escape.rs | 65.15% <0.00%> (-1.28%) |
:arrow_down: |
| src/de/seq.rs | 91.83% <0.00%> (-0.76%) |
:arrow_down: |
| src/se/mod.rs | 93.81% <0.00%> (-0.01%) |
:arrow_down: |
| src/writer.rs | 90.36% <0.00%> (+0.02%) |
:arrow_up: |
| ... and 4 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update e701c4d...ac7b67b. Read the comment docs.
I plan to continue working on this over the next week or two
@Mingun Questions:
Currently we have the functions unescaped_value, unescaped_value_with_custom_entities and their decode equivalents, that do the escaping part but don't implement the rest of the XML attribute-value-normalization spec. I'm not sure I see any reason for those to continue to exist as far as XML is concerned, but for HTML it makes some sense, as HTML only seems to do unescaping without any other normalization of the value.
- Does that sound accurate / align with your knowledge
- Do you think it would make more sense to stick with functional names, like
normalized_value/unescaped_valueand rely on the documentation to tell users what they ought to be using, or switch more descriptive names such ashtml_value/xml_value - The behavior of
Attributesdepends on whether or not.htmlis set, should we look at doing something similar here, or would that not be worth the additional complication
- I haven't studied a situation about HTML attributes, therefore, I rely on your understanding of the situation. Then, if you include references to relevant resources in the documentation, I'll be able to learn something about that
- I think that functional names are better
- Probably we should inverse things and introduce different types for XML / HTML attributes, then implement only relevant methods on each. This will also solve some unpleasant things in the current API -- we can change
htmlity of the attributes in the middle of iteration that probably could to lead to sophisticated bugs. I would like to avoid such dangerous usage.
I'm basically going off of the lack of any kind of discussion of attribute value normalization in the HTML living spec, and this discussion on stackoverflow
https://html.spec.whatwg.org/multipage/dom.html#attributes https://html.spec.whatwg.org/multipage/syntax.html#attributes-2 https://stackoverflow.com/questions/63906320/html5-attribute-value-normalization
I think that functional names are better
I agree
Probably we should inverse things and introduce different types for XML / HTML attributes, then implement only relevant methods on each. This will also solve some unpleasant things in the current API -- we can change htmlity of the attributes in the middle of iteration that probably could to lead to sophisticated bugs. I would like to avoid such dangerous usage.
Can they? It looks like all these fields are private. But, I feel like this is still the best option. There is already attributes() and html_attributes(), it's only a matter of changing the types.
The unfortunate thing will be, that the code between the two is almost the same, just enough so that it will be really annoying to duplicate.
Apologies on the lack of forward progress, I haven't been able / willing to devote a lot of extra time to this (or the encoding work) in a while.
@Mingun The partially-but-not-completely recursive aspect of the algorithm in step 3 has ended up being a massive timesink, and I haven't yet gotten it to work. For the sake of making some forward progress, would you be OK with an implementation that handles all of the parts of the algorithm except for the recursive resolution of entity references, and a TODO item for the recursive aspect?
I have no issues with making the other adjustments (such as adding examples), though I'd rather handle the structural API changes separately from this PR.
Well, it's still in draft for a reason :) I'll clean it up before asking for a review
For your reference -- recently discovered an embryonic project for testing XML parser implementations to the standard's conformance: https://github.com/JOSEPHGILBY/xml-conformance-rs. Not looked at it yet, but seems encouraging.
Looks nice, it may even make sense to integrate the test code into our test suite.
I did a cursory review of the changes and it looks fine. Would you prefer the commits squashed or kept separate?
I'll address everything else tomorrow.
I prefer to kept separate. It is somehow psychologically uncomfortable for review when one commit changes more than ~200 lines in each (or just several) files, even if half of them -- new tests. ¯_(ツ)_/¯
I think, that at least separating normalization method to it's own commit would be a good idea. This is pretty isolated thing which, however, is big enough. That commit needed to be updated:
- Add tests for non-ASCII input
- Introduce new error kind and return it when
depthbecome0. That means that we reach recursion limit - (could be postponed) I would like to have limit configurable
- (could be postponed) Add a way to explicitly detect recursion (i.e. track the resolved entities and report which entity was defined recursively)
- (could be postponed) Add an ability to pass metainfo around entities. That way we can provide a way to report in error where the erroneous entity is defined, if resolver function provide that information
Introduce new error kind and return it when depth become 0. That means that we reach recursion limit
(could be postponed) I would like to have limit configurable
How should one configure the limit? At some point it becomes unwieldy to keep all of this state external and provide it in each method call (we also have the XML / HTML divergence in attribute handling to consider).
Should we consider keeping the state in Reader and doing something along the lines of
reader.normalize_attribute_value(attr), orattr.normalize_value_with(resolve_entity, reader)orattr.normalize_value_with(reader)(moving theresolve_entityintoReaderentirely)?
Also I've noticed that some implementations detect an entity loop immediately instead of processing until the recursion limit is reached. Should that be two separate errors in your opinion, or one error?
@Mingun ^