[RFE] User-provided encoding detection function

Open dralley opened this issue 3 years ago • 1 comments

As an example, the XML specification recommends a special encoding detection scheme in cases where the BOM doesn't exist: https://www.w3.org/TR/xml11/#sec-guessing

In the event that a single-byte, ascii-compatible encoding is being used, you're supposed to inspect the XML declaration to determine which specific encoding to use.

My initial thoughts about implementing this are: the code could just read a full buffer of data (instead of using BomPeeker) and pass a reference to this buffer directly to the encoding detection functions (Encoding::for_bom(&[u8]) and / or a user-provided one), adjusting self.pos if necessary to ignore the BOM.

That would simplify the code at the same time, with the caveat that the user must ensure that the buffer is sufficient for the detection schemes (e.g., minimum 3 bytes for BOM detection), but that feels like a reasonable restriction?

It could look something like this:

pub fn xml_detect_encoding(bytes: &[u8]) -> Option<&'static Encoding> {
    match bytes {
        _ if bytes.starts_with(&[0x00, b'<', 0x00, b'?']) => Some((UTF_16BE, 0)), // Some BE encoding, for example, UTF-16 or ISO-10646-UCS-2
        _ if bytes.starts_with(&[b'<', 0x00, b'?', 0x00]) => Some((UTF_16LE, 0)), // Some LE encoding, for example, UTF-16 or ISO-10646-UCS-2
        _ if bytes.starts_with(&[b'<', b'?', b'x', b'm']) => { // Some ASCII compatible
             unimplemented!(r#"parse from the XML 'encoding' tag e.g. <?xml version="1.0" encoding="UTF-8" standalone="no" ?> "#);
       }
        _ => None,
    }
}

let f = File::open("inputdata.xml")?;
let mut rdr = DecodeReaderBytes::new(f);
rdr.detect_encoding_with(xml_detect_encoding)?;
assert_eq!(rdr.encoding(), UTF_16LE);

It could be a DecodeReaderBytesBuilder option instead of an explicit function call, but it feels like a slightly different category from the others.

Sep 05 '22 04:09 dralley

I'm unsure of what the precise implementation should be (I do not have it paged into context), but I think this is maybe something in scope?

Thoughts:

Having it as a method on the reader feels wrong to me. Because now you have to call that method before using the reader. This seems like it is definitively a builder option.
I am worried that things are not actually as simple as being suggested by your API suggestion. That is, the Read trait provides no guarantees about how much data is read. So your actual encoding detection algorithm and API need to be incremental. That greatly increases complexity.
How does this new option integrate with all of the rest of the options?

Sep 05 '22 09:09 BurntSushi