rust-encoding icon indicating copy to clipboard operation
rust-encoding copied to clipboard

Readers?

Open marcusklaas opened this issue 9 years ago • 10 comments

It would be convenient to have an object that implements Read, so one could for example easily and efficiently read from a file in an encoding other than utf-8.

marcusklaas avatar Oct 27 '15 10:10 marcusklaas

It sounds like you want something that implements not std::io::Read (which is a stream of bytes) but another trait for a Unicode stream. But as discussed in this RFC: https://github.com/rust-lang/rfcs/pull/57, doing it for reading is tricky. The bytes one takes a &mut [u8] argument, writes to it, and returns the number of written bytes. But doing that with &mut str might require some zeroing, or something. The contents of str must be well-formed UTF-8.

I’m experimenting with things that could help here. I’ll post again where there’s something more fully formed to show.

SimonSapin avatar Oct 27 '15 10:10 SimonSapin

Sorry for my vague description. I meant some kind of adapter between a stream of bytes in for examples Windows-1252 and a stream of bytes in utf-8. The unicode stream would be very nice, but there's a lot of code that already works with std::io::Read.

marcusklaas avatar Oct 27 '15 10:10 marcusklaas

That sounds like it could be built on top of "raw" decoders.

SimonSapin avatar Oct 27 '15 11:10 SimonSapin

… probably with an impl of encoding::types::StringWriter for &mut [u8], to be used with the argument to Read::read.

SimonSapin avatar Oct 27 '15 11:10 SimonSapin

Any progress? Anything changed since last time that would make it easier?

bbigras avatar Aug 13 '16 23:08 bbigras

I just came across the same myself. Would this be something that is in the scope of the crate?

mitsuhiko avatar Dec 11 '16 18:12 mitsuhiko

I have to write these impls for a project of mine and would also like to hear whether @lifthrasiir thinks they might be in scope for this crate.

I've also started a conversation on the encoding_rs crate: https://github.com/hsivonen/encoding_rs/issues/8

BurntSushi avatar Mar 08 '17 13:03 BurntSushi

To cross pollinate a bit here from the encoding_rs crate... @SimonSapin and I worked on our own versions of Read trait implementations (except @SimonSapin did quite a bit more!). @SimonSapin's work is in this PR: https://github.com/hsivonen/encoding_rs/pull/9 My work is here: https://github.com/BurntSushi/ripgrep/blob/75f1855a91ca00b5d0e62740595b1b91bc5142a2/src/decoder.rs

The big idea here is that implementing these traits is quite tricky, and neither of our implementations is fully correct. Mine gets most of the way there, but doesn't support single-byte-reads, which means the bytes adapter method doesn't work at all. It's possible to make this work, but requires a bit more book-keeping.

BurntSushi avatar Mar 13 '17 11:03 BurntSushi

I wonder if the traits are misdesigned for non utf-8 usage. It's weird that they work with both strings and bytes.

mitsuhiko avatar Mar 13 '17 12:03 mitsuhiko

In my case, I very much wanted to ever avoid materializing a &str and the costs associated with it. So operating on &[u8] is perfect.

BurntSushi avatar Mar 13 '17 12:03 BurntSushi