Do not return error on parsing preperly escaped zero values, which are perfectly valid. Add simple test to check it.

Oct 26 '22 20:10 sashka

I'm not sure this is true

http://lists.xml.org/archives/xml-dev/201211/msg00014.html http://lists.xml.org/archives/xml-dev/201211/msg00012.html

Oct 26 '22 21:10 dralley

Codecov Report

Merging #496 (aa932fd) into master (0c6eeb0) will decrease coverage by 0.03%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #496      +/-   ##
==========================================
- Coverage   61.70%   61.67%   -0.04%     
==========================================
  Files          33       33              
  Lines       15343    15337       -6     
==========================================
- Hits         9468     9459       -9     
- Misses       5875     5878       +3

Flag	Coverage Δ
unittests	`61.67% <100.00%> (-0.04%)`	:arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/escapei.rs	`12.40% <100.00%> (-0.14%)`	:arrow_down:
src/de/mod.rs	`66.95% <0.00%> (-0.29%)`	:arrow_down:
src/lib.rs	`13.17% <0.00%> (-0.08%)`	:arrow_down:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Oct 26 '22 21:10 codecov-commenter

According to https://www.w3.org/TR/REC-xml/ the value of  has never stated invalid. Got it: in TR XML 1.1 there's a statement "#x0 is still forbidden both directly and as a character reference."

I'll close this PR and will vendor a fork for internal needs then.

Oct 26 '22 21:10 sashka

Do your internal needs require handling a lot of documents with that issue? Can you speak about the source & purpose of said documents? The "API concerns" I have to imagine are related to C-strings (with null termination) and not strictly relevant to Rust. However, there would be nothing preventing Rust from reading such documents, even though we shouldn't allow them to be output.

(That is not to say that quick-xml should or will support them, but it could be discussed. I don't know what other XML libraries are doing but I imagine Java and C# ones may not always implement the checks since they use UTF-16 encodings)

Oct 26 '22 21:10 dralley

There are lots of huge CPU hardware-related documents (dozens of gigabytes), and there are properly escaped nulls. Minor example of the same kind is Apple's keyboard layout XML.

In both cases I have to read those XMLs exclusively by Rust, and I never write XML, so in my practice I would never face an error on attempt to write NULL.

On the other hand, I may put read-write for NULL under a feature flag to allow this use case when a user (like me or my engineers) sure they needs that.

Oct 26 '22 22:10 sashka

@Mingun (and @tafia), what are your thoughts about allowing nominally specification-breaking, but widespread behavior like this?

Is this something that might be acceptable under a feature flag?

Oct 27 '22 03:10 dralley

Yes, I actually does not understand, why XML forbid character references to some characters in the first place, and I'm glad to allow them. This ability under a feature flag would be welcome.

@sashka, can you add a properly documented feature flag (say, allow-restricted-chars) and add tests for all restricted characters (as defined in the rule RestrictedChar in https://www.w3.org/TR/xml11/#charsets) + ? If we actually already allow some of that ranges (I looks like we allow), then I think we should forbid they usage without a feature flag to be standard compatible. Then rename EntityWithNull to RestrictedChar, add a doc reference to the standard. We use XML 1.1

Oct 27 '22 04:10 Mingun

Let's open an issue and collect a bit more information before we merge anything - for instance, what do other XML libraries do.

It looks like nokogiri may have at one point just stripped these characters out instead of raising an error. However, attempting to write XML with those characters will raise a fatal error.

On the other hand it looks like both Python's ElementTree and libxml2 raise a hard error even while parsing

In [24]: xml = """<?xml version="1.0"?>
    ...: <data>
    ...:     <country>foobar &#0;</country>
    ...: </data>
    ...: """

In [25]: libxml2.parseDoc(xml)
Entity: line 3: parser error : xmlParseCharRef: invalid xmlChar value 0
    <country>foobar &#0;</country>
                        ^

In [3]: ET.fromstring(xml)
Traceback (most recent call last):

  Input In [3] in <cell line: 1>
    ET.fromstring(xml)

ParseError: reference to invalid character number: line 3, column 20

I believe Python's XML parsing is based on Expat, so between that and libxml2, that covers 2 of the most-used XML libraries.

Are the NUL characters in these documents "load-bearing"? Would they be unsafe to strip out?

Oct 27 '22 04:10 dralley

The following behaviors are possible:

emit hard error (= stop parsing / writing and return error)
silently skip invalid characters (or at least log them, but we currently does not depend on any logger)
accept invalid characters

Each behavior could be applied to a raw or an escaped form of characters.

I think, that behavior 2 is an anti-pattern and we should not allow it. I cannot imagine a situation where you would need to silently change the data that you operate. If you for whatever reason need that, then you can done that manually in Writer API and we can add an API in serde Serializer for that.

The behaviors 1 and 3 could be chosen either via feature flag, or via a run-time option. I'll agree with any of the options or even a combination of them.

I didn't have situations where I needed a dynamic choice of this option, so the feature flag should be enough in the first stage and easier to implement.

The choice between support only escaped form, or both raw and escaped form also could be leaved as implementation details in the first stage. Ah, and I suspect, that currently we forbid only escaped form of , but not raw \0.

There is also a technical question for a feature flag: it should allow restricted chars, forbidden without them, or it should restrict chars, allowed without them? In both cases the flag is additive is some sense, but I think that the allowing behavior is better. Of course, we could have both flags, but does not allow to enable them simultaneously, but I think that is overkill for now.

Any thoughts?

Oct 27 '22 17:10 Mingun

Let's open an issue and collect a bit more information before we merge anything - for instance, what do other XML libraries do.

Actually, you've already open it -- #368.

Oct 27 '22 17:10 Mingun

Surprisingly, the only restricted character which is actually restricted is the escaped NULL:

fn test_restricted_chars_unescape() {
    assert_eq!(unescape("\u{0}").unwrap(), "\u{0}");
    // assert_eq!(unescape("&#x0;").unwrap(), "\u{0}"); // <-- this assert will fail with version 0.26.0

    assert_eq!(unescape("\u{1}").unwrap(), "\u{1}");
    assert_eq!(unescape("&#x1;").unwrap(), "\u{1}");
    assert_eq!(unescape("\u{2}").unwrap(), "\u{2}");
    assert_eq!(unescape("&#x2;").unwrap(), "\u{2}");
    assert_eq!(unescape("\u{3}").unwrap(), "\u{3}");
    assert_eq!(unescape("&#x3;").unwrap(), "\u{3}");
    assert_eq!(unescape("\u{4}").unwrap(), "\u{4}");
    assert_eq!(unescape("&#x4;").unwrap(), "\u{4}");
    assert_eq!(unescape("\u{5}").unwrap(), "\u{5}");
    assert_eq!(unescape("&#x5;").unwrap(), "\u{5}");
    assert_eq!(unescape("\u{6}").unwrap(), "\u{6}");
    assert_eq!(unescape("&#x6;").unwrap(), "\u{6}");
    assert_eq!(unescape("\u{7}").unwrap(), "\u{7}");
    assert_eq!(unescape("&#x7;").unwrap(), "\u{7}");
    assert_eq!(unescape("\u{8}").unwrap(), "\u{8}");
    assert_eq!(unescape("&#x8;").unwrap(), "\u{8}");

    assert_eq!(unescape("\u{b}").unwrap(), "\u{b}");
    assert_eq!(unescape("&#xB;").unwrap(), "\u{b}");
    assert_eq!(unescape("\u{c}").unwrap(), "\u{c}");
    assert_eq!(unescape("&#xC;").unwrap(), "\u{c}");

    assert_eq!(unescape("\u{e}").unwrap(), "\u{e}");
    assert_eq!(unescape("&#xE;").unwrap(), "\u{e}");
    assert_eq!(unescape("\u{f}").unwrap(), "\u{f}");
    assert_eq!(unescape("&#xF;").unwrap(), "\u{f}");
    assert_eq!(unescape("\u{10}").unwrap(), "\u{10}");
    assert_eq!(unescape("&#x10;").unwrap(), "\u{10}");
    assert_eq!(unescape("\u{11}").unwrap(), "\u{11}");
    assert_eq!(unescape("&#x11;").unwrap(), "\u{11}");
    assert_eq!(unescape("\u{12}").unwrap(), "\u{12}");
    assert_eq!(unescape("&#x12;").unwrap(), "\u{12}");
    assert_eq!(unescape("\u{13}").unwrap(), "\u{13}");
    assert_eq!(unescape("&#x13;").unwrap(), "\u{13}");
    assert_eq!(unescape("\u{14}").unwrap(), "\u{14}");
    assert_eq!(unescape("&#x14;").unwrap(), "\u{14}");
    assert_eq!(unescape("\u{15}").unwrap(), "\u{15}");
    assert_eq!(unescape("&#x15;").unwrap(), "\u{15}");
    assert_eq!(unescape("\u{16}").unwrap(), "\u{16}");
    assert_eq!(unescape("&#x16;").unwrap(), "\u{16}");
    assert_eq!(unescape("\u{17}").unwrap(), "\u{17}");
    assert_eq!(unescape("&#x17;").unwrap(), "\u{17}");
    assert_eq!(unescape("\u{18}").unwrap(), "\u{18}");
    assert_eq!(unescape("&#x18;").unwrap(), "\u{18}");
    assert_eq!(unescape("\u{19}").unwrap(), "\u{19}");
    assert_eq!(unescape("&#x19;").unwrap(), "\u{19}");
    assert_eq!(unescape("\u{1a}").unwrap(), "\u{1a}");
    assert_eq!(unescape("&#x1A;").unwrap(), "\u{1a}");
    assert_eq!(unescape("\u{1b}").unwrap(), "\u{1b}");
    assert_eq!(unescape("&#x1B;").unwrap(), "\u{1b}");
    assert_eq!(unescape("\u{1c}").unwrap(), "\u{1c}");
    assert_eq!(unescape("&#x1C;").unwrap(), "\u{1c}");
    assert_eq!(unescape("\u{1d}").unwrap(), "\u{1d}");
    assert_eq!(unescape("&#x1D;").unwrap(), "\u{1d}");
    assert_eq!(unescape("\u{1e}").unwrap(), "\u{1e}");
    assert_eq!(unescape("&#x1E;").unwrap(), "\u{1e}");
    assert_eq!(unescape("\u{1f}").unwrap(), "\u{1f}");
    assert_eq!(unescape("&#x1F;").unwrap(), "\u{1f}");

    assert_eq!(unescape("\u{7f}").unwrap(), "\u{7f}");
    assert_eq!(unescape("&#x7F;").unwrap(), "\u{7f}");
    assert_eq!(unescape("\u{80}").unwrap(), "\u{80}");
    assert_eq!(unescape("&#x80;").unwrap(), "\u{80}");
    assert_eq!(unescape("\u{81}").unwrap(), "\u{81}");
    assert_eq!(unescape("&#x81;").unwrap(), "\u{81}");
    assert_eq!(unescape("\u{82}").unwrap(), "\u{82}");
    assert_eq!(unescape("&#x82;").unwrap(), "\u{82}");
    assert_eq!(unescape("\u{83}").unwrap(), "\u{83}");
    assert_eq!(unescape("&#x83;").unwrap(), "\u{83}");
    assert_eq!(unescape("\u{84}").unwrap(), "\u{84}");
    assert_eq!(unescape("&#x84;").unwrap(), "\u{84}");

    assert_eq!(unescape("\u{86}").unwrap(), "\u{86}");
    assert_eq!(unescape("&#x86;").unwrap(), "\u{86}");
    assert_eq!(unescape("\u{87}").unwrap(), "\u{87}");
    assert_eq!(unescape("&#x87;").unwrap(), "\u{87}");
    assert_eq!(unescape("\u{88}").unwrap(), "\u{88}");
    assert_eq!(unescape("&#x88;").unwrap(), "\u{88}");
    assert_eq!(unescape("\u{89}").unwrap(), "\u{89}");
    assert_eq!(unescape("&#x89;").unwrap(), "\u{89}");
    assert_eq!(unescape("\u{8a}").unwrap(), "\u{8a}");
    assert_eq!(unescape("&#x8A;").unwrap(), "\u{8a}");
    assert_eq!(unescape("\u{8b}").unwrap(), "\u{8b}");
    assert_eq!(unescape("&#x8B;").unwrap(), "\u{8b}");
    assert_eq!(unescape("\u{8c}").unwrap(), "\u{8c}");
    assert_eq!(unescape("&#x8C;").unwrap(), "\u{8c}");
    assert_eq!(unescape("\u{8d}").unwrap(), "\u{8d}");
    assert_eq!(unescape("&#x8D;").unwrap(), "\u{8d}");
    assert_eq!(unescape("\u{8e}").unwrap(), "\u{8e}");
    assert_eq!(unescape("&#x8E;").unwrap(), "\u{8e}");
    assert_eq!(unescape("\u{8f}").unwrap(), "\u{8f}");
    assert_eq!(unescape("&#x8F;").unwrap(), "\u{8f}");
    assert_eq!(unescape("\u{90}").unwrap(), "\u{90}");
    assert_eq!(unescape("&#x90;").unwrap(), "\u{90}");
    assert_eq!(unescape("\u{91}").unwrap(), "\u{91}");
    assert_eq!(unescape("&#x91;").unwrap(), "\u{91}");
    assert_eq!(unescape("\u{92}").unwrap(), "\u{92}");
    assert_eq!(unescape("&#x92;").unwrap(), "\u{92}");
    assert_eq!(unescape("\u{93}").unwrap(), "\u{93}");
    assert_eq!(unescape("&#x93;").unwrap(), "\u{93}");
    assert_eq!(unescape("\u{94}").unwrap(), "\u{94}");
    assert_eq!(unescape("&#x94;").unwrap(), "\u{94}");
    assert_eq!(unescape("\u{95}").unwrap(), "\u{95}");
    assert_eq!(unescape("&#x95;").unwrap(), "\u{95}");
    assert_eq!(unescape("\u{96}").unwrap(), "\u{96}");
    assert_eq!(unescape("&#x96;").unwrap(), "\u{96}");
    assert_eq!(unescape("\u{97}").unwrap(), "\u{97}");
    assert_eq!(unescape("&#x97;").unwrap(), "\u{97}");
    assert_eq!(unescape("\u{98}").unwrap(), "\u{98}");
    assert_eq!(unescape("&#x98;").unwrap(), "\u{98}");
    assert_eq!(unescape("\u{99}").unwrap(), "\u{99}");
    assert_eq!(unescape("&#x99;").unwrap(), "\u{99}");
    assert_eq!(unescape("\u{9a}").unwrap(), "\u{9a}");
    assert_eq!(unescape("&#x9A;").unwrap(), "\u{9a}");
    assert_eq!(unescape("\u{9b}").unwrap(), "\u{9b}");
    assert_eq!(unescape("&#x9B;").unwrap(), "\u{9b}");
    assert_eq!(unescape("\u{9c}").unwrap(), "\u{9c}");
    assert_eq!(unescape("&#x9C;").unwrap(), "\u{9c}");
    assert_eq!(unescape("\u{9d}").unwrap(), "\u{9d}");
    assert_eq!(unescape("&#x9D;").unwrap(), "\u{9d}");
    assert_eq!(unescape("\u{9e}").unwrap(), "\u{9e}");
    assert_eq!(unescape("&#x9E;").unwrap(), "\u{9e}");
    assert_eq!(unescape("\u{9f}").unwrap(), "\u{9f}");
    assert_eq!(unescape("&#x9F;").unwrap(), "\u{9f}");
}

Oct 27 '22 17:10 sashka

It appears that C#'s dotnet XML library has a flag for this: https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmlreadersettings.checkcharacters?view=net-6.0

Same for writing: https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmlwritersettings.checkcharacters?view=net-6.0

Surprisingly, the only restricted character which is actually restricted is the escaped NULL:

I'm not surprised given I filed https://github.com/tafia/quick-xml/issues/368. It's something we needed to improve anyways. Do you have any avenue to report the invalid XML? Obviously that's not an immediate solution, I am just curious.

Oct 27 '22 18:10 dralley

I'll keep all the work for allow-restricted-chars right there: https://github.com/sashka/quick-xml/tree/allow-restricted-chars

Do you have any avenue to report the invalid XML? Obviously that's not an immediate solution, I am just curious.

No, vendors don't care about general XML validity as soon as their XML files are not for public distribution, and their own tools process their own files perfectly.

Oct 27 '22 22:10 sashka

The following behaviors are possible:

emit hard error (= stop parsing / writing and return error)

silently skip invalid characters (or at least log them, but we currently does not depend on any logger)

accept invalid characters

Each behavior could be applied to a raw or an escaped form of characters.

I think, that behavior 2 is an anti-pattern and we should not allow it. I cannot imagine a situation where you would need to silently change the data that you operate. If you for whatever reason need that, then you can done that manually in Writer API and we can add an API in serde Serializer for that.

Agree that we shouldn't skip over invalid data. Seems like a terrible approach.

The behaviors 1 and 3 could be chosen either via feature flag, or via a run-time option. I'll agree with any of the options or even a combination of them.

I didn't have situations where I needed a dynamic choice of this option, so the feature flag should be enough in the first stage and easier to implement.

The choice between support only escaped form, or both raw and escaped form also could be leaved as implementation details in the first stage. Ah, and I suspect, that currently we forbid only escaped form of �, but not raw \0.

There is also a technical question for a feature flag: it should allow restricted chars, forbidden without them, or it should restrict chars, allowed without them? In both cases the flag is additive is some sense, but I think that the allowing behavior is better. Of course, we could have both flags, but does not allow to enable them simultaneously, but I think that is overkill for now.

Any thoughts?

An initial conservative approach could be to just provide an advanced version of the various escape / unescape functions which would allow you to configure the behavior either way at runtime. I think I would prefer that approach over a feature flag.

Agreed that we're still missing the checks for these characters in the raw form and that we ought to have them.

Reading through the MS docs again, it appears that CheckCharacters applies only to character entities. Names and raw text still need to be completely valid and contain no illegal characters no matter what - and it's checked always. @sashka I assume all of the illegal characters are at least encoded as character references and not present in raw form? If that's the case then copying that approach would be sufficient. And as the configuration would only apply to (un)escaping, there wouldn't be issues with needing a pseudo-global configuration.

Oct 28 '22 05:10 dralley

Does that sound right @sashka ?

Nov 09 '22 00:11 dralley

@dralley, yes, it sounds right for me. I'm working on proper reading mechanics now.

Nov 10 '22 16:11 sashka

@sashka Do you plan to continue with this?

Jul 10 '23 20:07 dralley

Allow `&#0;` values

Codecov Report

Allow `` values