xml-rs icon indicating copy to clipboard operation
xml-rs copied to clipboard

Encoding

Open jangernert opened this issue 4 years ago • 2 comments

  • fix some basic warnings like try!() -> ?
  • parse the encoding whatwg label
  • convert text in XmlEvent::Characters to utf-8 if a different encoding was detected

Basic idea was discussed here: https://github.com/feed-rs/feed-rs/issues/58

jangernert avatar Jun 24 '20 08:06 jangernert

This looks like it's implemented purely as a level above the existing parser. In particular, if I'm reading it correctly:

  • It operates as a level above the existing pull parser that assumes UTF-8. I believe that parser won't successfully read the start event (<?xml version="1.0" encoding="..."?>) if those characters aren't represented as in ASCII, and I believe it will fail if it encounters any invalid UTF-8 sequences.
  • It only translates XmlEvent::Characters, not processing instructions, tag names, attribute names/values, or comments.

I think if you add tests, you'll find that it doesn't work. In particular:

  • For UTF-16, it will fail, because it can't detect the encoding.
  • Likewise single-byte encodings that aren't ASCII-compatible, like EBCDIC.
  • For single-byte encodings that are ASCII-compatible (eg ISO-8859-1), it will work only in the ASCII range. The high bits will fail because they aren't valid UTF-8 encodings. So the only change is that it produces errors more often (correctly or not).

I think the right approach is:

  1. When an external encoding is supplied, try that.
  2. Otherwise, try to detect the encoding from the first bytes of the file, as in the specification.
  3. Then decode all input bytes using that encoding, including the XML delimiters.

scottlamb avatar Nov 16 '21 18:11 scottlamb

Yes, I'm afraid this approach does not follow the XML spec. I have some initial attempts to tackle the encoding problem (including a streaming wrapper around a BufRead which performs decoding into Rust strings) in the parser-rearchitecture branch, but it is unlikely I'll be able to finish this work :(

netvl avatar Nov 18 '21 20:11 netvl

The encoding crate is obsolete. Instead of using it, I've added support for latin1 and ASCII.

kornelski avatar May 11 '23 00:05 kornelski