text icon indicating copy to clipboard operation
text copied to clipboard

`streamDecodeUtf8` and `streamDecodeUtf8With` defeat the purpose of the `ByteString` field in `Some Text ByteString (ByteString -> Decoding)`

Open michaelt opened this issue 10 years ago • 7 comments

This should perhaps be a comment under the heading of https://github.com/bos/text/pull/55 . The difficulties are outlined a bit laboriously beginning around here https://github.com/michaelt/text-pipes/issues/1#issuecomment-28099431 .

The difficulty can I think be summed up this way: the user cannot detect from Some txt bs f whether bs is the beginning of what might end happily as a decent bit of Text -- and is thus a fragment already represented in the continuation f -- or whether instead bs begins with some genuinely uninterpretable bytes. For this reason the field is actually useless.

Because of this we cannot write a (seemingly sensible) function decodeUtf8 :: Producer ByteString m r -> Producer Text m (Producer ByteString m r) which would convert a stream of ByteString chunks into a stream of Text chunks -- but stop when it hits any unsalvageable non-Text and return the remaining ByteString producer for further scrutiny. In the end one is still reduced to a construction like the one in Conduit.Text http://hackage.haskell.org/package/conduit-1.0.9.3/docs/src/Data-Conduit-Text.html#byteSplits

While trying to figure it out I made an uninspired variant of the Decode type -- data Decode = Some Text ByteString (ByteString -> Decode) | Other Text ByteString and rang corresponding changes in the definition of streamDecodeWith. https://github.com/michaelt/text/commit/942e4c894fbb6cafd8aafdc5c5f148e85ac66dbf#diff-1c42d239accd16adcae29f591c519ac6 I was able to convince myself that all the requisite information could be recovered by a corresponding property in the test suite.

I hope this is more or less lucid.

michaelt avatar Dec 08 '13 21:12 michaelt

The behvaiour in the case of uninterpretable bytes (i.e. invalid UTF-8) is the same for streamDecodeUtf8 as for decodeUtf8: a UnicodeException is thrown, either by the function itself or – if the bad data is in a followup chunk – by its continuation.

Is this the behaviour you are taking exception (pardon the pun) to? I gather that you'd prefer to have another function (streamDecodeUtf8 has escaped into the wild now, so we don't get to change its API) that gives you more information, perhaps Either DecodeError Decoding.

bos avatar Jan 02 '14 22:01 bos

As I say at end of the following, it might be that this should be relabeled as a request for a function decodeSomeUtf8 :: ByteString -> (Text,ByteString) defined with something like c_decode_utf8_with_state and without recourse to the exceptions machinery. But the following is the remark that led me to that thought:

Right, that the behavior is the same as with decodeUtf8 is exactly the point. If I want to find the initial segment of a ByteString that is intelligible as utf8 encoded Text and distinguish it from the rest (whether or not the latter is given in some sort of exception), I must do what the stream libraries typically do, aptly called 'splitSlowly' in the enumerator package http://hackage.haskell.org/package/enumerator-0.4.20/docs/src/Data-Enumerator-Text.html#splitSlowly -- i.e. apply decodeUtf8 to every initial segment of the bytestring, until I find one that works. The same operation is necessary with the new streamDecodeUtf8. (That it is necessary to break at the last Word8 where things were intelligible, rather than the last intelligible chunk, follows from the idea of streaming text where everything must be independent of the way things are chunked.)

To judge from the discussion of https://github.com/bos/text/pull/55, though, the purpose of the distinction of fields Text and ByteString in the Decoding type was in part to evade something like splitSlowly. I think the distinction is not quite adequate for that, though it is easy to do with the materials (almost) present in Data.Text.Encoding. --Or maybe I'm wrong and that wasn't the purpose and this is a request for further functionality.

The more informative function would need a type more like ByteString -> Either Decoding Decoding or ByteString -> Either (Text,ByteString) Decoding or ByteString -> (Bool,Decoding) . The trouble is that Some already has ByteString -> Decoding as one of its fields. I will try to think what would work. I think you would know immediately if you were convinced of the desirability of avoiding something like splitSlowly .

Here is the rearranged variant of Data.Text.Encoding I was considering using for Pipes.Text.Internal https://gist.github.com/michaelt/8243401 . (It is very straightforward since it drops the exceptions machinery.) It is followed by the relevant pipes use of it. With the real streamDecodeUtf8 the definition would be similar, but would involve a detour using decodeUtf8 and splitSlowly. If there were also a decodeSomeUtf8 function this could be done more sensibly. I guess I will try to write one.

michaelt avatar Jan 03 '14 20:01 michaelt

I made a decodeSomeUtf8 :: ByteString -> (Text,ByteString) and a quickcheck test, for what it's worth https://github.com/michaelt/text/commit/e8e818d5c571aee7bcd6afd62b6459f1d11540aa . It uses the existing model, which is presumably unnecessarily complicated; it only has need of the fact that the imported C function 'backs up' to the point of breakdown.

The resulting pipes decodeUtf8 definition of course uses unsafePerformIO , since it has to be pure, or rather monad-indifferent. Here's what I came up with, not that I suppose you are familiar with the pipes business, but you can see the difference from the previous gist: https://gist.github.com/michaelt/8243401#file-decodeutf8ii-hs

michaelt avatar Jan 04 '14 00:01 michaelt

I've run into a number of bugs in conduit's textual decoding due to mismatches in behavior to the homegrown functions @michaelt linked to and text's behavior. Today, I decided to finally bite the bullet and rewrite this using the text codebase. The result is at:

https://github.com/snoyberg/conduit/blob/f15e5b8706a47d8aa419922fb45b12d06e22ad91/conduit/Data/Text/StreamDecoding.hs

I'm no expert on the internals of text, so I'm a little hesitant to start using this code. It passes all of the tests I've thrown at it, but given recent blog posts, I don't think that's sufficient. Said another way: @bos: I'd really appreciate a code review of this.

The other question is where to go from here. I think ideally, this code (or some form of it) could be included in text. It would be generally usable by all streaming data frameworks.

If @bos doesn't want this included, then either (1) it could continue to live in conduit as it is right now, and be used only by conduit, or (2) if there's interest in other packages using it, I can separate out the code to a separate package. One advantage to doing this as a separate package rather than inside text is that the separate package can maintain compatibility with older versions of text, especially those used by the existing Haskell Platform.

snoyberg avatar Feb 12 '14 11:02 snoyberg

This is awesome. I made some quickcheck tests for streamUtf8 using the genUnicode I got from somewhere in the text test material. It may not be too efficient at rooting out disaster; I haven't thought about it much yet. It's based on things I had trouble with making the existing draft of pipes-text.

The main test adds random bytes at the end of a utf8-encoded text; the result, X, gets chunked at random, then empty bytestrings are sometimes interpolated; then we see if streamUtf8 followed by encodeUf8 can give us back X.

The three properties, for what they may be worth, have passed 10,000,000 tests, which is pleasing. I will try to think how to make it more rational, and how to extend it to the other encodings if you like.

michaelt avatar Feb 13 '14 17:02 michaelt

Sorry, of course I forgot to link the tests https://gist.github.com/michaelt/8979818

michaelt avatar Feb 13 '14 17:02 michaelt

I'm glad you commented, I forgot to update this issue.

I've factored the code out into its own library: https://github.com/fpco/text-stream-decode. I'd be very happy to include additional tests, like the ones you provided. If you want, I can give you commit access to that repo, and you can augment the existing test suite. You may also have some ideas on improvements to the existing codebase.

snoyberg avatar Feb 13 '14 17:02 snoyberg

I believe the original problem is largely fixed in #448 by adding a more powerful streaming interface (decodeUtf8Chunk and friends) and can be closed. @Lysxia what do you think?

Bodigrim avatar Feb 21 '24 00:02 Bodigrim