guava Null character should be omitted by XMLEscaper

As per the specifications (https://www.w3.org/TR/REC-xml/#charsets), control characters under 0x20 (except #x9, #xA, #xD) should be omitted in XML documents. However, XmlEscapers.xmlContentEscaper() does not account for this.

Steps to reproduce:

CharEscaper xmlAttributeEscaper = (CharEscaper) XmlEscapers.xmlAttributeEscaper();
assertEquals("XML 1.1 should omit #x0", "ab", xmlContentEscaper.escape("a\u0000b"));//fails, should omit control char```

Dec 24 '18 09:12 devikasondhi

The issue also holds for the characters: #xd800-#xdfff

Dec 24 '18 09:12 devikasondhi

Thanks. I thought we had a bug for this, but I can't find it.

As you note, null is forbidden under XML 1.1 as well as 1.0 (though 1.1 does permit some characters that 1.0 does not, IIRC). We link to the XML 1.0 spec but should be clearer about the fact that that's what we use. And of course we should also document the broken behavior, even if we can't fix it yet.

Jan 30 '19 19:01 cpovirk

We've come across at least a couple places in our codebase that produce XML with non-Guava APIs (e.g., DOM APIs), which also need to know if a character is valid, separate from the escaping that is performed by the underlying APIs if necessary.

Even with the Guava API, users might be interested in the ability to scan a string for invalid characters so as to replace them with ? or something before escaping them, rather than silently discarding them. Of course, such functionality could also be provided by having our escapers accept an optional CodingErrorAction-like strategy parameter (with the existing methods presumably also changed to at least implement one of the reasonable strategies).

All that said, I don't think we're any more likely to provide such functionality this year than we were in 2018, 2019, 2020, 2021, 2022, or 2023 :(

Jun 14 '24 20:06 cpovirk