encoding_rs_io
encoding_rs_io copied to clipboard
Always transcode Utf8
I'm using encoding_rs_io to make a stream of always valid utf8 because invalid utf8 is not handled upstream. The way the options are laid out at present it seems there's no way to force transcoding to occur if there is no BOM in the file. I think I found a way to do it by layering multiple DecodeReaderBytes over each other, but I'm unsure that it works in all cases and a little dismayed that it requires multiple layers instead of just having an option to force transcoding.
Here's the code I have today:
pub fn new_utf8_reader(data: &[u8]) -> impl Read + '_ {
let cursor = Cursor::new(data);
// The first layer has utf8-passthrough, and the second
// no passthrough but an explicit encoding. This unexpected
// chain was concocted to handle the case where the file has
// no BOM and is encoded with something other than utf8 or
// contains invalid utf-8 characters. Basically, this
// forces transcoding.
// When there is a non UFT-8 BOM the first layer will transcode to UTF-8. (so will the second, redundantly)
// When there is no BOM or a UTF-8 BOM the second layer will transcode to UTF-8.
let uncorrected = DecodeReaderBytesBuilder::new()
.utf8_passthru(true)
.build(cursor);
DecodeReaderBytesBuilder::new()
.encoding(Some(UTF_8))
.strip_bom(true)
.build(uncorrected)
}
Is this the best way to force transcoding to utf8 in the presence of unknown data (which may or may not contain a BOM, and may or may not be valid) given the API today?
I don't think I'm the only one with this problem. It took some time to figure out an answer. Would it be worth it to do one of the following...
- Add this usage to the documentation as a recipe?
- Introduce a new 'force transcoding' option?
- Add a factory function that does this?
Of course as soon as I open an issue I find another way. Is this equivalent to the previous?
pub fn new_utf8_reader(data: &[u8]) -> impl Read + '_ {
let cursor = Cursor::new(data);
// Force transcoding to utf-8 in all cases,
// whether there is a BOM or not, and whether
// the input is valid or not.
DecodeReaderBytesBuilder::new()
.encoding(Some(UTF_8))
.bom_override(true)
.build(cursor)
}
Yes, your second comment is correct.
The parameter space for behavior here is unfortunately very large, so it can be tricky to discover the right set of parameters. I wouldn't be opposed to a FAQ or something similarish.