xml-rs
xml-rs copied to clipboard
Encoding
- fix some basic warnings like
try!()
->?
- parse the encoding whatwg label
- convert text in
XmlEvent::Characters
toutf-8
if a different encoding was detected
Basic idea was discussed here: https://github.com/feed-rs/feed-rs/issues/58
This looks like it's implemented purely as a level above the existing parser. In particular, if I'm reading it correctly:
- It operates as a level above the existing pull parser that assumes UTF-8. I believe that parser won't successfully read the start event (
<?xml version="1.0" encoding="..."?>
) if those characters aren't represented as in ASCII, and I believe it will fail if it encounters any invalid UTF-8 sequences. - It only translates
XmlEvent::Characters
, not processing instructions, tag names, attribute names/values, or comments.
I think if you add tests, you'll find that it doesn't work. In particular:
- For UTF-16, it will fail, because it can't detect the encoding.
- Likewise single-byte encodings that aren't ASCII-compatible, like EBCDIC.
- For single-byte encodings that are ASCII-compatible (eg ISO-8859-1), it will work only in the ASCII range. The high bits will fail because they aren't valid UTF-8 encodings. So the only change is that it produces errors more often (correctly or not).
I think the right approach is:
- When an external encoding is supplied, try that.
- Otherwise, try to detect the encoding from the first bytes of the file, as in the specification.
- Then decode all input bytes using that encoding, including the XML delimiters.
Yes, I'm afraid this approach does not follow the XML spec. I have some initial attempts to tackle the encoding problem (including a streaming wrapper around a BufRead
which performs decoding into Rust strings) in the parser-rearchitecture branch, but it is unlikely I'll be able to finish this work :(
The encoding
crate is obsolete. Instead of using it, I've added support for latin1 and ASCII.