[work-in-progress] Decoding BufReader implementation
Very, very work-in-progress. Doesn't compile yet. Only posting it for transparency.
Did you consider to use encoding_rs_io?
Never came across it, no. I'll take a look.
This library looks nice, there are a few limitations to think about how to work around (or not), but it does enough of the right things that I'll use it for now and we can figure out the rest later.
- getting a raw BOM back from the reader - you can ignore it entirely in which case it will be returned but not used for sniffing, or you can use it for sniffing in which case it will be removed from the stream (because otherwise the output wouldn't be valid)
- manually setting the decoder to a particular encoding variant mid-stream based on the
encodingattribute. It does its own sniffing at runtime but doesn't provide a way to change it manually after building the decoder.
However I'll repeat that I don't see any great ways to accomplish this without having separate structs as in the other PR. Reader is generic over R, and R needs to be decoded and buffered, which pushes the burden of doing so onto the user, because Reader can't wrap the inner R in a decoder or buffer without adding overhead when working with slices.
Readeris generic overR, andRneeds to be decoded and buffered
But after this PR we should perform decoding only for buffered case (because borrowed case will read from &str, right?) and we already require the user to provide it's own buffer for event data:
https://github.com/tafia/quick-xml/blob/ad57bc29e8d9b10309748ce362fffc3580322cf5/src/reader/buffered_reader.rs#L62
We should just change the type of that buffer to a String and add a small internal buffer for holding bytes if we stop in the middle of a char. Maybe even that buffer already managed by encoding_rs crate.
Even if we would allow to read from &[u8] using borrowed methods, we could just return an error if we couldn't read an event without allocation (for decoding or unescaping), as serde does in similar cases.
But after this PR we should perform decoding only for buffered case (because borrowed case will read from &str, right?)
Yes but that's not really the issue I'm describing, basically when you have Reader<&str> on one hand but also Reader<BufReader<DecodeReaderBytes<File>>> on the other, constructors that are working for with Reader<R> become a mess because they're no longer creating Reader<R> but rather Reader<BufReader<DecodeReaderBytes<R>>>. And you don't want to change the actual struct because there's only one of them, and adding BufReader<DecodeReaderBytes<R>> to the struct would introduce overhead for slices where it doesn't need to exist.
Think about Reader::from_reader(reader: R) and NsReader::from_reader(reader: R) are where it is most painful. I'm still trying to work out how to make everything line up, if that's possible. edit: I'm going to sleep now but feel free to take a crack at it if you want, it may help to understand the motivations behind the separate struct change.
and we already require the user to provide it's own buffer for event data: We should just change the type of that buffer to a String
That's the first step, yes, although I'm not sure if it's the ideal end point - the buffer could just as well be internal rather than user-provided. I'm not sure of the lifetime implications of that yet but it would be nice to just use read_event() everywhere, have it borrow from the buffer inside the reader, and avoid forcing the user to manage the buffer by calling .clear() manually in the right place. For memory use purposes though, we might want to provide some way for the user to shrink the buffer if they desire.
and add a small internal buffer for holding bytes if we stop in the middle of a char. Maybe even that buffer already managed by encoding_rs crate.
It's managed by the encoding_rs_io crate, we don't need to be concerned about it if we're using that. read() will only ever return UTF-8.
Codecov Report
Merging #441 (10038ca) into master (90be3ee) will increase coverage by
0.69%. The diff coverage is83.63%.
@@ Coverage Diff @@
## master #441 +/- ##
==========================================
+ Coverage 52.12% 52.81% +0.69%
==========================================
Files 28 28
Lines 13480 13392 -88
==========================================
+ Hits 7026 7073 +47
+ Misses 6454 6319 -135
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 52.81% <83.63%> (+0.69%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| benches/microbenches.rs | 0.00% <0.00%> (ø) |
|
| examples/custom_entities.rs | 0.00% <0.00%> (ø) |
|
| examples/nested_readers.rs | 0.00% <0.00%> (ø) |
|
| examples/read_buffered.rs | 0.00% <0.00%> (ø) |
|
| examples/read_texts.rs | 0.00% <0.00%> (ø) |
|
| src/errors.rs | 2.08% <0.00%> (+0.01%) |
:arrow_up: |
| src/reader/buffered_reader.rs | 69.89% <0.00%> (-6.25%) |
:arrow_down: |
| src/writer.rs | 48.30% <0.00%> (-0.28%) |
:arrow_down: |
| src/de/var.rs | 75.00% <11.11%> (-10.11%) |
:arrow_down: |
| src/de/escape.rs | 19.84% <40.00%> (-1.21%) |
:arrow_down: |
| ... and 26 more |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
constructors that are working for with
Reader<R>become a mess because they're no longer creatingReader<R>but ratherReader<BufReader<DecodeReaderBytes<R>>>.
Hmm, why they should? If you implement constructor for Reader<BufReader<DecodeReaderBytes<R>>>, they will not exist for Reader<&str>. If generic constructors in Reader<R> becomes useless, then make them private:
impl<R> Reader<R> {
// former from_reader()
fn new(reader: R) -> Self {...}
}
impl<'i> Reader<&'i str> {
pub fn from_str(string: &'i str) -> Self {...}
}
impl<R: BufRead> Reader<BufReader<DecodeReaderBytes<R>>> {
pub fn from_reader(reader: R) -> Self {...}
}
I'm not sure of the lifetime implications of that yet but it would be nice to just use
read_event()everywhere, have it borrow from the buffer inside the reader, and avoid forcing the user to manage the buffer by calling.clear()manually in the right place.
But @tafia designed that interface specially for to not store buffer inside and allow the end user to manage the buffer in a way that he needs. I think, we should preserve this behavior. This design is still good because in the end the event is not tied to the Reader that produced it. Probably that design should help with solving some usage problems, such as #360, which could become painful if we start borrow from the reader.
Regarding encoding_rs_io problems -- I see several ways how we could handle them:
- do not use
encoding_rs_io, but write our own implementation - make a PR(s) in
encoding_rs_iothat would to help to accomplish our goals - determine encoding without
encoding_rs_io, and when we readDeclevent, switch toencoding_rs_ioreading
The second choice seems the most promising for me.
I'm not sure, that we should remove StartText event, because it still can be useful even if will not contain BOM. The text before the declaration is not considered as a part of XML model, so it should be treated differently, for example, the unescaping of that text has no meaning.
Even in that case though, one potential mitigation path would be to ignore such text by default (as in, skip emitting the event), which would be pretty easy to do as we already treat the first event differently. And if we decided to make it configurable, the user in that case would have to "know what they're doing" and likely accept a few minor usage quirks (because they would be explicitly working non-standard documents).
I'm not sure it's worth being too concerned about since the number of documents with this problem - and not just whitespace or newlines but significant data - is probably quite small. Right now, the only complaint has been about the BOM causing trouble, and that will be resolved without needing StartText anymore. Maybe StartText could make sense in the future, but I feel like we should have a user making a solid case for it before we add the complexity.
Does that sound reasonable?
Because almost all supported encodings are one-byte encodings, we could create XML files with all possible symbols (probably except that explicitly forbidden by the XML standard). The national symbols should be presented in all possible places (tag names, namespace names, attribute names and values, text and CDATA content, comments, processing instructions).
Only after that, I believe, we should made any improvements in encoding support.
We should probably do that before we declare perfect support for all these encodings, but I disagree that this ought to block the ergonomic and architectural improvements which are the focus of this PR. I'm not claiming that all encodings will be 100% supported after this PR immediately (although the situation will likely be better than it is currently).
I don't know that putting every possible symbol in every possible place in the XML is going to actually be worthwhile since the whole goal here is to completely separate parsing from decoding and make sure the parser only ever sees valid UTF-8. Fuzzing the parser with crazy combinations (but valid) UTF-8 makes sense, and fuzzing the decoer while validating that the output is valid UTF-8 makes sense, but those are separate concerns.
I agree that encoding detection should be a focus though, especially because unlike the decoding itself it does have tie-ins with the reader / parser.
detecting encoding (I found that we do not do that correctly, I'll plan to look at this before release, I already made some investigations and improvements, but not yet proved them with tests)
Details?
The basic functionality is pretty much working. TODO:
- [ ] encoding detection regression for UTF-16 (failing tests)
- [ ] async support in
Utf8BytesReader - [ ] UTF-8 validation inside
Utf8BytesReaderwhen theencodingfeature is turned off - [ ] some other cleanup
Async support is the real pain unfortunately.
Status update: I'm still working on this, but making PRs against https://github.com/BurntSushi/encoding_rs_io/
And also https://github.com/hsivonen/encoding_rs/issues/88
I'm not sure how long I'm going to be blocked on that, so if there is a portion of the work you think could be split off and merged, I'd appreciate that. Otherwise I'll probably work on the attribute normalization PR.
So, as a practical matter I now think we need to strip the BOM in all situations, no matter what.
Not because there's anything invalid about it at a purely technical level, but because buried deep down on page 157 out of 1054 of the Unicode standard it says that for UTF-16 the BOM "should not be considered part of the text". It actually says nothing about that for UTF-8 and in fact implies the opposite, but that line appears to have been the basis for encoding_rs to remove the BOM from the stream for both UTF-16 and UTF-8.
Realistically I don't think we should cut against the grain of what encoding_rs is doing, it's probably a sane decision for consistency purposes anyway, and we need to get the same output in both the encoding and not-encoding cases - so that means the BOM should never be present.
Do what you think right, just explain decisions in the comments in the code
@mingun I'm just going to rip out any changes to the serde code and do that absolutely last, so I don't need to keep redoing it
It's been a month without any further progress on getting https://github.com/BurntSushi/encoding_rs_io/pull/11 merged, so I'm just going to vendor the code internally for now. It's not that much code and he was very skeptical about merging async support anyway, which we're kind of obligated to support now.