text
text copied to clipboard
`streamDecodeUtf8` and `streamDecodeUtf8With` defeat the purpose of the `ByteString` field in `Some Text ByteString (ByteString -> Decoding)`
This should perhaps be a comment under the heading of https://github.com/bos/text/pull/55 . The difficulties are outlined a bit laboriously beginning around here https://github.com/michaelt/text-pipes/issues/1#issuecomment-28099431 .
The difficulty can I think be summed up this way: the user cannot detect from Some txt bs f
whether bs
is the beginning of what might end happily as a decent bit of Text
-- and is thus a fragment already represented in the continuation f
-- or whether instead bs
begins with some genuinely uninterpretable bytes. For this reason the field is actually useless.
Because of this we cannot write a (seemingly sensible) function decodeUtf8 :: Producer ByteString m r -> Producer Text m (Producer ByteString m r)
which would convert a stream of ByteString
chunks into a stream of Text
chunks -- but stop when it hits any unsalvageable non-Text and return the remaining ByteString
producer for further scrutiny. In the end one is still reduced to a construction like the one in Conduit.Text
http://hackage.haskell.org/package/conduit-1.0.9.3/docs/src/Data-Conduit-Text.html#byteSplits
While trying to figure it out I made an uninspired variant of the Decode
type -- data Decode = Some Text ByteString (ByteString -> Decode) | Other Text ByteString
and rang corresponding changes in the definition of streamDecodeWith
. https://github.com/michaelt/text/commit/942e4c894fbb6cafd8aafdc5c5f148e85ac66dbf#diff-1c42d239accd16adcae29f591c519ac6 I was able to convince myself that all the requisite information could be recovered by a corresponding property in the test suite.
I hope this is more or less lucid.
The behvaiour in the case of uninterpretable bytes (i.e. invalid UTF-8) is the same for streamDecodeUtf8
as for decodeUtf8
: a UnicodeException
is thrown, either by the function itself or – if the bad data is in a followup chunk – by its continuation.
Is this the behaviour you are taking exception (pardon the pun) to? I gather that you'd prefer to have another function (streamDecodeUtf8
has escaped into the wild now, so we don't get to change its API) that gives you more information, perhaps Either DecodeError Decoding
.
As I say at end of the following, it might be that this should be relabeled as a request for a function decodeSomeUtf8 :: ByteString -> (Text,ByteString)
defined with something like c_decode_utf8_with_state
and without recourse to the exceptions machinery. But the following is the remark that led me to that thought:
Right, that the behavior is the same as with decodeUtf8
is exactly the point. If I want to find the initial segment of a ByteString
that is intelligible as utf8 encoded Text
and distinguish it from the rest (whether or not the latter is given in some sort of exception), I must do what the stream libraries typically do, aptly called 'splitSlowly' in the enumerator package http://hackage.haskell.org/package/enumerator-0.4.20/docs/src/Data-Enumerator-Text.html#splitSlowly -- i.e. apply decodeUtf8
to every initial segment of the bytestring, until I find one that works. The same operation is necessary with the new streamDecodeUtf8
. (That it is necessary to break at the last Word8
where things were intelligible, rather than the last intelligible chunk, follows from the idea of streaming text where everything must be independent of the way things are chunked.)
To judge from the discussion of https://github.com/bos/text/pull/55, though, the purpose of the distinction of fields Text
and ByteString
in the Decoding
type was in part to evade something like splitSlowly
. I think the distinction is not quite adequate for that, though it is easy to do with the materials (almost) present in Data.Text.Encoding
. --Or maybe I'm wrong and that wasn't the purpose and this is a request for further functionality.
The more informative function would need a type more like ByteString -> Either Decoding Decoding
or ByteString -> Either (Text,ByteString) Decoding
or ByteString -> (Bool,Decoding)
. The trouble is that Some
already has ByteString -> Decoding
as one of its fields. I will try to think what would work. I think you would know immediately if you were convinced of the desirability of avoiding something like splitSlowly
.
Here is the rearranged variant of Data.Text.Encoding
I was considering using for Pipes.Text.Internal
https://gist.github.com/michaelt/8243401 . (It is very straightforward since it drops the exceptions machinery.) It is followed by the relevant pipes use of it. With the real streamDecodeUtf8
the definition would be similar, but would involve a detour using decodeUtf8
and splitSlowly
. If there were also a decodeSomeUtf8
function this could be done more sensibly. I guess I will try to write one.
I made a decodeSomeUtf8 :: ByteString -> (Text,ByteString)
and a quickcheck test, for what it's worth https://github.com/michaelt/text/commit/e8e818d5c571aee7bcd6afd62b6459f1d11540aa . It uses the existing model, which is presumably unnecessarily complicated; it only has need of the fact that the imported C function 'backs up' to the point of breakdown.
The resulting pipes decodeUtf8
definition of course uses unsafePerformIO
, since it has to be pure, or rather monad-indifferent. Here's what I came up with, not that I suppose you are familiar with the pipes business, but you can see the difference from the previous gist: https://gist.github.com/michaelt/8243401#file-decodeutf8ii-hs
I've run into a number of bugs in conduit's textual decoding due to mismatches in behavior to the homegrown functions @michaelt linked to and text's behavior. Today, I decided to finally bite the bullet and rewrite this using the text
codebase. The result is at:
https://github.com/snoyberg/conduit/blob/f15e5b8706a47d8aa419922fb45b12d06e22ad91/conduit/Data/Text/StreamDecoding.hs
I'm no expert on the internals of text
, so I'm a little hesitant to start using this code. It passes all of the tests I've thrown at it, but given recent blog posts, I don't think that's sufficient. Said another way: @bos: I'd really appreciate a code review of this.
The other question is where to go from here. I think ideally, this code (or some form of it) could be included in text
. It would be generally usable by all streaming data frameworks.
If @bos doesn't want this included, then either (1) it could continue to live in conduit
as it is right now, and be used only by conduit
, or (2) if there's interest in other packages using it, I can separate out the code to a separate package. One advantage to doing this as a separate package rather than inside text
is that the separate package can maintain compatibility with older versions of text
, especially those used by the existing Haskell Platform.
This is awesome. I made some quickcheck tests for streamUtf8
using the genUnicode
I got from somewhere in the text
test material. It may not be too efficient at rooting out disaster; I haven't thought about it much yet. It's based on things I had trouble with making the existing draft of pipes-text.
The main test adds random bytes at the end of a utf8-encoded text; the result, X, gets chunked at random, then empty bytestrings are sometimes interpolated; then we see if streamUtf8
followed by encodeUf8
can give us back X.
The three properties, for what they may be worth, have passed 10,000,000 tests, which is pleasing. I will try to think how to make it more rational, and how to extend it to the other encodings if you like.
Sorry, of course I forgot to link the tests https://gist.github.com/michaelt/8979818
I'm glad you commented, I forgot to update this issue.
I've factored the code out into its own library: https://github.com/fpco/text-stream-decode. I'd be very happy to include additional tests, like the ones you provided. If you want, I can give you commit access to that repo, and you can augment the existing test suite. You may also have some ideas on improvements to the existing codebase.
I believe the original problem is largely fixed in #448 by adding a more powerful streaming interface (decodeUtf8Chunk
and friends) and can be closed. @Lysxia what do you think?