quick-xml
quick-xml copied to clipboard
Allow `�` values
Do not return error on parsing preperly escaped zero values, which are perfectly valid. Add simple test to check it.
I'm not sure this is true
http://lists.xml.org/archives/xml-dev/201211/msg00014.html http://lists.xml.org/archives/xml-dev/201211/msg00012.html
Codecov Report
Merging #496 (aa932fd) into master (0c6eeb0) will decrease coverage by
0.03%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #496 +/- ##
==========================================
- Coverage 61.70% 61.67% -0.04%
==========================================
Files 33 33
Lines 15343 15337 -6
==========================================
- Hits 9468 9459 -9
- Misses 5875 5878 +3
Flag | Coverage Δ | |
---|---|---|
unittests | 61.67% <100.00%> (-0.04%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
src/escapei.rs | 12.40% <100.00%> (-0.14%) |
:arrow_down: |
src/de/mod.rs | 66.95% <0.00%> (-0.29%) |
:arrow_down: |
src/lib.rs | 13.17% <0.00%> (-0.08%) |
:arrow_down: |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
According to https://www.w3.org/TR/REC-xml/ the value of
Got it: in TR XML 1.1 there's a statement "#x0 is still forbidden both directly and as a character reference."�
has never stated invalid.
I'll close this PR and will vendor a fork for internal needs then.
Do your internal needs require handling a lot of documents with that issue? Can you speak about the source & purpose of said documents? The "API concerns" I have to imagine are related to C-strings (with null termination) and not strictly relevant to Rust. However, there would be nothing preventing Rust from reading such documents, even though we shouldn't allow them to be output.
(That is not to say that quick-xml
should or will support them, but it could be discussed. I don't know what other XML libraries are doing but I imagine Java and C# ones may not always implement the checks since they use UTF-16 encodings)
There are lots of huge CPU hardware-related documents (dozens of gigabytes), and there are properly escaped nulls. Minor example of the same kind is Apple's keyboard layout XML.
In both cases I have to read those XMLs exclusively by Rust, and I never write XML, so in my practice I would never face an error on attempt to write NULL.
On the other hand, I may put read-write for NULL under a feature flag to allow this use case when a user (like me or my engineers) sure they needs that.
@Mingun (and @tafia), what are your thoughts about allowing nominally specification-breaking, but widespread behavior like this?
Is this something that might be acceptable under a feature flag?
Yes, I actually does not understand, why XML forbid character references to some characters in the first place, and I'm glad to allow them. This ability under a feature flag would be welcome.
@sashka, can you add a properly documented feature flag (say, allow-restricted-chars
) and add tests for all restricted characters (as defined in the rule RestrictedChar
in https://www.w3.org/TR/xml11/#charsets) + �
? If we actually already allow some of that ranges (I looks like we allow), then I think we should forbid they usage without a feature flag to be standard compatible. Then rename EntityWithNull
to RestrictedChar
, add a doc reference to the standard. We use XML 1.1
Let's open an issue and collect a bit more information before we merge anything - for instance, what do other XML libraries do.
It looks like nokogiri may have at one point just stripped these characters out instead of raising an error. However, attempting to write XML with those characters will raise a fatal error.
On the other hand it looks like both Python's ElementTree and libxml2 raise a hard error even while parsing
In [24]: xml = """<?xml version="1.0"?>
...: <data>
...: <country>foobar �</country>
...: </data>
...: """
In [25]: libxml2.parseDoc(xml)
Entity: line 3: parser error : xmlParseCharRef: invalid xmlChar value 0
<country>foobar �</country>
^
In [3]: ET.fromstring(xml)
Traceback (most recent call last):
Input In [3] in <cell line: 1>
ET.fromstring(xml)
ParseError: reference to invalid character number: line 3, column 20
I believe Python's XML parsing is based on Expat, so between that and libxml2, that covers 2 of the most-used XML libraries.
Are the NUL characters in these documents "load-bearing"? Would they be unsafe to strip out?
The following behaviors are possible:
- emit hard error (= stop parsing / writing and return error)
- silently skip invalid characters (or at least log them, but we currently does not depend on any logger)
- accept invalid characters
Each behavior could be applied to a raw or an escaped form of characters.
I think, that behavior 2 is an anti-pattern and we should not allow it. I cannot imagine a situation where you would need to silently change the data that you operate. If you for whatever reason need that, then you can done that manually in Writer
API and we can add an API in serde Serializer
for that.
The behaviors 1 and 3 could be chosen either via feature flag, or via a run-time option. I'll agree with any of the options or even a combination of them.
I didn't have situations where I needed a dynamic choice of this option, so the feature flag should be enough in the first stage and easier to implement.
The choice between support only escaped form, or both raw and escaped form also could be leaved as implementation details in the first stage. Ah, and I suspect, that currently we forbid only escaped form of �
, but not raw \0
.
There is also a technical question for a feature flag: it should allow restricted chars, forbidden without them, or it should restrict chars, allowed without them? In both cases the flag is additive is some sense, but I think that the allowing behavior is better. Of course, we could have both flags, but does not allow to enable them simultaneously, but I think that is overkill for now.
Any thoughts?
Let's open an issue and collect a bit more information before we merge anything - for instance, what do other XML libraries do.
Actually, you've already open it -- #368.
Surprisingly, the only restricted character which is actually restricted is the escaped NULL:
fn test_restricted_chars_unescape() {
assert_eq!(unescape("\u{0}").unwrap(), "\u{0}");
// assert_eq!(unescape("�").unwrap(), "\u{0}"); // <-- this assert will fail with version 0.26.0
assert_eq!(unescape("\u{1}").unwrap(), "\u{1}");
assert_eq!(unescape("").unwrap(), "\u{1}");
assert_eq!(unescape("\u{2}").unwrap(), "\u{2}");
assert_eq!(unescape("").unwrap(), "\u{2}");
assert_eq!(unescape("\u{3}").unwrap(), "\u{3}");
assert_eq!(unescape("").unwrap(), "\u{3}");
assert_eq!(unescape("\u{4}").unwrap(), "\u{4}");
assert_eq!(unescape("").unwrap(), "\u{4}");
assert_eq!(unescape("\u{5}").unwrap(), "\u{5}");
assert_eq!(unescape("").unwrap(), "\u{5}");
assert_eq!(unescape("\u{6}").unwrap(), "\u{6}");
assert_eq!(unescape("").unwrap(), "\u{6}");
assert_eq!(unescape("\u{7}").unwrap(), "\u{7}");
assert_eq!(unescape("").unwrap(), "\u{7}");
assert_eq!(unescape("\u{8}").unwrap(), "\u{8}");
assert_eq!(unescape("").unwrap(), "\u{8}");
assert_eq!(unescape("\u{b}").unwrap(), "\u{b}");
assert_eq!(unescape("").unwrap(), "\u{b}");
assert_eq!(unescape("\u{c}").unwrap(), "\u{c}");
assert_eq!(unescape("").unwrap(), "\u{c}");
assert_eq!(unescape("\u{e}").unwrap(), "\u{e}");
assert_eq!(unescape("").unwrap(), "\u{e}");
assert_eq!(unescape("\u{f}").unwrap(), "\u{f}");
assert_eq!(unescape("").unwrap(), "\u{f}");
assert_eq!(unescape("\u{10}").unwrap(), "\u{10}");
assert_eq!(unescape("").unwrap(), "\u{10}");
assert_eq!(unescape("\u{11}").unwrap(), "\u{11}");
assert_eq!(unescape("").unwrap(), "\u{11}");
assert_eq!(unescape("\u{12}").unwrap(), "\u{12}");
assert_eq!(unescape("").unwrap(), "\u{12}");
assert_eq!(unescape("\u{13}").unwrap(), "\u{13}");
assert_eq!(unescape("").unwrap(), "\u{13}");
assert_eq!(unescape("\u{14}").unwrap(), "\u{14}");
assert_eq!(unescape("").unwrap(), "\u{14}");
assert_eq!(unescape("\u{15}").unwrap(), "\u{15}");
assert_eq!(unescape("").unwrap(), "\u{15}");
assert_eq!(unescape("\u{16}").unwrap(), "\u{16}");
assert_eq!(unescape("").unwrap(), "\u{16}");
assert_eq!(unescape("\u{17}").unwrap(), "\u{17}");
assert_eq!(unescape("").unwrap(), "\u{17}");
assert_eq!(unescape("\u{18}").unwrap(), "\u{18}");
assert_eq!(unescape("").unwrap(), "\u{18}");
assert_eq!(unescape("\u{19}").unwrap(), "\u{19}");
assert_eq!(unescape("").unwrap(), "\u{19}");
assert_eq!(unescape("\u{1a}").unwrap(), "\u{1a}");
assert_eq!(unescape("").unwrap(), "\u{1a}");
assert_eq!(unescape("\u{1b}").unwrap(), "\u{1b}");
assert_eq!(unescape("").unwrap(), "\u{1b}");
assert_eq!(unescape("\u{1c}").unwrap(), "\u{1c}");
assert_eq!(unescape("").unwrap(), "\u{1c}");
assert_eq!(unescape("\u{1d}").unwrap(), "\u{1d}");
assert_eq!(unescape("").unwrap(), "\u{1d}");
assert_eq!(unescape("\u{1e}").unwrap(), "\u{1e}");
assert_eq!(unescape("").unwrap(), "\u{1e}");
assert_eq!(unescape("\u{1f}").unwrap(), "\u{1f}");
assert_eq!(unescape("").unwrap(), "\u{1f}");
assert_eq!(unescape("\u{7f}").unwrap(), "\u{7f}");
assert_eq!(unescape("").unwrap(), "\u{7f}");
assert_eq!(unescape("\u{80}").unwrap(), "\u{80}");
assert_eq!(unescape("€").unwrap(), "\u{80}");
assert_eq!(unescape("\u{81}").unwrap(), "\u{81}");
assert_eq!(unescape("").unwrap(), "\u{81}");
assert_eq!(unescape("\u{82}").unwrap(), "\u{82}");
assert_eq!(unescape("‚").unwrap(), "\u{82}");
assert_eq!(unescape("\u{83}").unwrap(), "\u{83}");
assert_eq!(unescape("ƒ").unwrap(), "\u{83}");
assert_eq!(unescape("\u{84}").unwrap(), "\u{84}");
assert_eq!(unescape("„").unwrap(), "\u{84}");
assert_eq!(unescape("\u{86}").unwrap(), "\u{86}");
assert_eq!(unescape("†").unwrap(), "\u{86}");
assert_eq!(unescape("\u{87}").unwrap(), "\u{87}");
assert_eq!(unescape("‡").unwrap(), "\u{87}");
assert_eq!(unescape("\u{88}").unwrap(), "\u{88}");
assert_eq!(unescape("ˆ").unwrap(), "\u{88}");
assert_eq!(unescape("\u{89}").unwrap(), "\u{89}");
assert_eq!(unescape("‰").unwrap(), "\u{89}");
assert_eq!(unescape("\u{8a}").unwrap(), "\u{8a}");
assert_eq!(unescape("Š").unwrap(), "\u{8a}");
assert_eq!(unescape("\u{8b}").unwrap(), "\u{8b}");
assert_eq!(unescape("‹").unwrap(), "\u{8b}");
assert_eq!(unescape("\u{8c}").unwrap(), "\u{8c}");
assert_eq!(unescape("Œ").unwrap(), "\u{8c}");
assert_eq!(unescape("\u{8d}").unwrap(), "\u{8d}");
assert_eq!(unescape("").unwrap(), "\u{8d}");
assert_eq!(unescape("\u{8e}").unwrap(), "\u{8e}");
assert_eq!(unescape("Ž").unwrap(), "\u{8e}");
assert_eq!(unescape("\u{8f}").unwrap(), "\u{8f}");
assert_eq!(unescape("").unwrap(), "\u{8f}");
assert_eq!(unescape("\u{90}").unwrap(), "\u{90}");
assert_eq!(unescape("").unwrap(), "\u{90}");
assert_eq!(unescape("\u{91}").unwrap(), "\u{91}");
assert_eq!(unescape("‘").unwrap(), "\u{91}");
assert_eq!(unescape("\u{92}").unwrap(), "\u{92}");
assert_eq!(unescape("’").unwrap(), "\u{92}");
assert_eq!(unescape("\u{93}").unwrap(), "\u{93}");
assert_eq!(unescape("“").unwrap(), "\u{93}");
assert_eq!(unescape("\u{94}").unwrap(), "\u{94}");
assert_eq!(unescape("”").unwrap(), "\u{94}");
assert_eq!(unescape("\u{95}").unwrap(), "\u{95}");
assert_eq!(unescape("•").unwrap(), "\u{95}");
assert_eq!(unescape("\u{96}").unwrap(), "\u{96}");
assert_eq!(unescape("–").unwrap(), "\u{96}");
assert_eq!(unescape("\u{97}").unwrap(), "\u{97}");
assert_eq!(unescape("—").unwrap(), "\u{97}");
assert_eq!(unescape("\u{98}").unwrap(), "\u{98}");
assert_eq!(unescape("˜").unwrap(), "\u{98}");
assert_eq!(unescape("\u{99}").unwrap(), "\u{99}");
assert_eq!(unescape("™").unwrap(), "\u{99}");
assert_eq!(unescape("\u{9a}").unwrap(), "\u{9a}");
assert_eq!(unescape("š").unwrap(), "\u{9a}");
assert_eq!(unescape("\u{9b}").unwrap(), "\u{9b}");
assert_eq!(unescape("›").unwrap(), "\u{9b}");
assert_eq!(unescape("\u{9c}").unwrap(), "\u{9c}");
assert_eq!(unescape("œ").unwrap(), "\u{9c}");
assert_eq!(unescape("\u{9d}").unwrap(), "\u{9d}");
assert_eq!(unescape("").unwrap(), "\u{9d}");
assert_eq!(unescape("\u{9e}").unwrap(), "\u{9e}");
assert_eq!(unescape("ž").unwrap(), "\u{9e}");
assert_eq!(unescape("\u{9f}").unwrap(), "\u{9f}");
assert_eq!(unescape("Ÿ").unwrap(), "\u{9f}");
}
It appears that C#'s dotnet XML library has a flag for this: https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmlreadersettings.checkcharacters?view=net-6.0
Same for writing: https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmlwritersettings.checkcharacters?view=net-6.0
Surprisingly, the only restricted character which is actually restricted is the escaped NULL:
I'm not surprised given I filed https://github.com/tafia/quick-xml/issues/368. It's something we needed to improve anyways. Do you have any avenue to report the invalid XML? Obviously that's not an immediate solution, I am just curious.
I'll keep all the work for allow-restricted-chars
right there: https://github.com/sashka/quick-xml/tree/allow-restricted-chars
Do you have any avenue to report the invalid XML? Obviously that's not an immediate solution, I am just curious.
No, vendors don't care about general XML validity as soon as their XML files are not for public distribution, and their own tools process their own files perfectly.
The following behaviors are possible:
- emit hard error (= stop parsing / writing and return error)
- silently skip invalid characters (or at least log them, but we currently does not depend on any logger)
- accept invalid characters
Each behavior could be applied to a raw or an escaped form of characters.
I think, that behavior 2 is an anti-pattern and we should not allow it. I cannot imagine a situation where you would need to silently change the data that you operate. If you for whatever reason need that, then you can done that manually in Writer API and we can add an API in serde Serializer for that.
Agree that we shouldn't skip over invalid data. Seems like a terrible approach.
The behaviors 1 and 3 could be chosen either via feature flag, or via a run-time option. I'll agree with any of the options or even a combination of them.
I didn't have situations where I needed a dynamic choice of this option, so the feature flag should be enough in the first stage and easier to implement.
The choice between support only escaped form, or both raw and escaped form also could be leaved as implementation details in the first stage. Ah, and I suspect, that currently we forbid only escaped form of �, but not raw \0.
There is also a technical question for a feature flag: it should allow restricted chars, forbidden without them, or it should restrict chars, allowed without them? In both cases the flag is additive is some sense, but I think that the allowing behavior is better. Of course, we could have both flags, but does not allow to enable them simultaneously, but I think that is overkill for now.
Any thoughts?
An initial conservative approach could be to just provide an advanced version of the various escape / unescape functions which would allow you to configure the behavior either way at runtime. I think I would prefer that approach over a feature flag.
Agreed that we're still missing the checks for these characters in the raw form and that we ought to have them.
Reading through the MS docs again, it appears that CheckCharacters
applies only to character entities. Names and raw text still need to be completely valid and contain no illegal characters no matter what - and it's checked always. @sashka I assume all of the illegal characters are at least encoded as character references and not present in raw form? If that's the case then copying that approach would be sufficient. And as the configuration would only apply to (un)escaping, there wouldn't be issues with needing a pseudo-global configuration.
Does that sound right @sashka ?
@dralley, yes, it sounds right for me. I'm working on proper reading mechanics now.
@sashka Do you plan to continue with this?