encoding_rs icon indicating copy to clipboard operation
encoding_rs copied to clipboard

Non-streaming decode() appears to remove the BOM?

Open dralley opened this issue 3 years ago • 3 comments

https://github.com/hsivonen/encoding_rs/blob/d4d7d2a99aac266ecf6938c3832aefaaf8c1e52b/src/lib.rs#L2974-L2980

Functionally, decode() and decode_with_bom_removal() seem pretty much the same? That doesn't seem correct? If there's a variant called "decode_with_bom_removal" then I would expect the standard variant not to remove the BOM.

Compare to:

https://github.com/hsivonen/encoding_rs/blob/d4d7d2a99aac266ecf6938c3832aefaaf8c1e52b/src/lib.rs#L3019-L3030

It's totally valid to decode the BOM, the BOM is a unicode character like any other character. Decoding a UTF-16 document with a BOM should yield a UTF-8 document with a BOM. Otherwise, you would just use the BOM-removing version...

use encoding_rs::*;

fn main() {
    // Two characters, '1' and then BOM character
    println!("{:?}", UTF_16LE.decode(&[0x31, 0x00, 0xFF, 0xFE]).0.as_bytes()); 
    // Nothing - BOM removed
    println!("{:?}", UTF_16LE.decode(&[0xFF, 0xFE]).0.as_bytes());
}
[49, 239, 187, 191]
[]

dralley avatar Sep 05 '22 22:09 dralley

The same thing happens with the streaming API. Am I missing something? Why does the decoder produced by new_decoder() strip the BOM when there is a separate method named new_decoder_with_bom_removal() that I am explicitly not using?

use encoding_rs::*; 

fn main() {
  
    let mut buf = [0; 100];
    
    dbg!(UTF_16LE.new_decoder().decode_to_utf8(
        &[0x31, 0x00, 0xFF, 0xFE],
        &mut buf,
        true,
    ));
    
    println!("{:?}", &buf);
    
    let mut buf = [0; 100];
    
    dbg!(UTF_16LE.new_decoder().decode_to_utf8(
        &[0xFF, 0xFE],
        &mut buf,
        true,
    ));
    
    println!("{:?}", &buf);
}
[src/main.rs:11] UTF_16LE.new_decoder().decode_to_utf8(&[0x31, 0x00, 0xFF, 0xFE], &mut buf,
    true) = (
    InputEmpty,
    4,
    4,
    false,
)
[src/main.rs:21] UTF_16LE.new_decoder().decode_to_utf8(&[0xFF, 0xFE], &mut buf, true) = (
    InputEmpty,
    2,
    0,
    false,
)

[49, 239, 187, 191, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

dralley avatar Sep 05 '22 22:09 dralley

@hsivonen Is there a detail that I'm missing?

dralley avatar Sep 09 '22 15:09 dralley

I could try to put up a PR, I just want to verify that there's not a reason behind this

dralley avatar Sep 14 '22 19:09 dralley

  • decode changes the encoding that's used according to the BOM and removes the BOM.
  • decode_with_bom_removal removes the (leading) BOM if it matches the BOM for the encoding but doesn't change the encoding.
  • decode_without_bom_handling gives no special treatment to the BOM.

AFAICT, these are working as documented.

hsivonen avatar Sep 15 '22 13:09 hsivonen

@hsivonen If what you say is true then the name of decode_with_bom_removal() doesn't indicate the actual difference between decode_with_bom_removal() and decode(). At minimum that is highly confusing.

It sounds like there is no way to respect the BOM (use the BOM-indicated encoding) without removing it, which is problematic - it's valid unicode, and sometimes you do want to keep it. For instance, if the data came from UTF-16 LE, and you want to perform some processing on it as UTF-8, before re-encoding it as UTF-16 LE. You could write the BOM separately, but that is sometimes less convenient than simply leaving it there to begin with.

dralley avatar Sep 15 '22 13:09 dralley

Furthermore I don't see any documentation indicating that decode() removes the BOM, could you point out where it is documented?

dralley avatar Sep 15 '22 13:09 dralley

So ignoring the documentation issue, I can understand why this is the case due to the way the WHATWG spec defines how decode should be done (though the name "_with_bom_removal()" is still misleading and doesn't quite match what the spec does).

Would you be open to merging function which did not do this?

dralley avatar Sep 16 '22 16:09 dralley

Furthermore I don't see any documentation indicating that decode() removes the BOM, could you point out where it is documented?

Good point. It says "BOM sniffing" without saying explicitly that BOM sniffing means BOM removal.

What's the use case for deciding the encoding from BOM but still letting the BOM show through to output?

hsivonen avatar Sep 20 '22 05:09 hsivonen

I'm working on improving the encoding support of quick-xml. One idea I had is that a user may want to get either:

  • a normalized series of XML events, which strips any non-meaningful whitespace including anything prior to the XML delcaration (including the BOM), or
  • a series of XML events, which would include whitespace between elements (and the BOM), which if passed back into a writer would exactly reproduce the document

The latter would potentially simplify making edits to existing documents, given that some software wants to see a UTF-8 BOM even though the spec recommends against it.

In order to pull that off, the BOM would need to not be stripped automatically. Whether it gets removed or not (alongside other things) would be handled by the Reader.

I'm open to a counterargument that it's not a good idea, though. One potential issue with the use case I just described is that there would be no way to automatically handle indentation for new additions into the document, because automatic indentation would need to be turned off to "exactly reproduce" everything else. But I think we would still need to know whether the BOM was present, at least.

dralley avatar Sep 20 '22 12:09 dralley

The BOM isn't part of the XML information set, so an XML use case hasn't arisen before. Does you exact reproduction mode also retain whitespace on either side of the equals sign for attributes, etc.?

Since this issue is filed against the non-streaming API, you can easily find out if there was a BOM by inspecting the buffer with Encoding::for_bom().

hsivonen avatar Sep 20 '22 12:09 hsivonen

Well, the streaming API does the same thing as mentioned in the first comment after the OP, I just first encountered this while doing some experimentation. The streaming API is what we would be using in practice.

The BOM isn't part of the XML information set

It's not, but neither is inter-element whitespace or any data prior to the XML declaration, hence why I figured it might make sense to treat them similarly. The BOM is relevant to some (mostly Windows) software even though, for UTF-8, it shouldn't be.

Does you exact reproduction mode also retain whitespace on either side of the equals sign for attributes, etc.?

At the moment this is just a concept, the actual encoding support is more important and is the main focus. You're right that it's non-trivial and would have to encompass a lot more than just whitespace between events. But I think that would be possible. Whether it's actually worthwhile, I'm not sure yet.

dralley avatar Sep 20 '22 12:09 dralley

Well, the streaming API does the same thing as mentioned in the first comment after the OP, I just first encountered this while doing some experimentation. The streaming API is what we would be using in practice.

I see. Still, I think at least for now, I'm going to continue to treat this as out of scope, since the BOM with BOM semantics isn't supposed to be part of the logical content of the stream.

hsivonen avatar Sep 20 '22 13:09 hsivonen

the BOM with BOM semantics isn't supposed to be part of the logical content of the stream.

This is the part I'm unsure on. Nothing about the BOM is particularly special, it's just a unicode character, etc. Isn't it just part of the document, then? Unicode says it should present as an invisible zero-width non-breaking space, which implies that it is expected to make it through to the presentation layer sometimes, and not just be stripped off in all cases.

Various documents I've read don't spell it out explicitly but they make it seem like part of the data stream itself, more akin to the #! shebang line at the beginning of a Python/Perl/Bash script than magic bytes at the beginning of a zip file. In one sense, the shebang line isn't "part of the script", but in another sense, it's just as much a part of the script as any other comment in the script.

https://www.w3.org/International/questions/qa-byte-order-mark https://unicode.org/faq/utf_bom.html#bom1

dralley avatar Sep 20 '22 14:09 dralley

See The Unicode Standard 15.0 D98 second bullet last sentence on page 131 (PDF page 157). See also the note in The Encoding Standard that explains that the naming doesn't match the naming in The Unicode Standard.

Note that the XML spec says of the BOM: "This is an encoding signature, not part of either the markup or the character data of the XML document." Also, the BNF in the XML spec doesn't consider the BOM part of the textual syntax.

hsivonen avatar Sep 21 '22 06:09 hsivonen

That appears to be correct for UTF-16. Regarding UTF-8, the bullet points under D95 on the previous page are frustratingly unclear - it has no such language and actually seems to imply the opposite.

While there is obviously no need for a byte order signature when using UTF-8, there are occasions when processes convert UTF-16 or UTF-32 data contain- ing a byte order mark into UTF-8. When represented in UTF-8, the byte order mark turns into the byte sequence <EF BB BF>. Its usage at the beginning of a UTF-8 data stream is neither required nor recommended by the Unicode Stan- dard, but its presence does not affect conformance to the UTF-8 encoding scheme. Identification of the <EF BB BF> byte sequence at the beginning of a data stream can, however, be taken as a near-certain indication that the data stream is using the UTF-8 encoding scheme.

But considering I really don't feel like lawyering the spec to that degree, point taken, we should probably just accept that the BOM will be stripped and force the user to write it themselves on the output side, if they want it.

I will file a new issue about documentation.

dralley avatar Sep 23 '22 02:09 dralley