text Allow caller more control in stream decoding

I've been wanting this feature for a while, so I decided to prototype it and create a pull request.

Jun 27 '22 05:06 david-sledge

Rather than STT, safer alternatives would be to specialize this to one specific monad (was state the original motivation?) or to use a different algorithm that does not involve STT.

Jun 27 '22 10:06 Lysxia

Rather than STT, safer alternatives would be to specialize this to one specific monad (was state the original motivation?) or to use a different algorithm that does not involve STT.

State (along with reader) was secondary. My primary motivation was to multiple monads in a stack (I use IO and continuation frequently). Specializing is not ideal, since there'd have to be a different one not just for each monad, but each combination of monads.

Jun 27 '22 13:06 david-sledge

STT is notably unsafe with continuations.

Jun 27 '22 17:06 Lysxia

text is a boot library, it cannot depend on STMonadTrans package, unfortunately.

I'm not very happy with the API of Data.Text.Encoding. The fact that the only way to abort decoding is to raise an error (and catch it elsewhere) is a joke. Ideally it should be possible to drive decoding from without, e. g.,

decodeUtf8With2 :: ByteString -> ByteString -> Either Int (Text, ByteString)

Then clients are free to interpret Int (the position, where decoding failed) as they like, in whatever monad.

Jun 27 '22 17:06 Bodigrim

text is a boot library, it cannot depend on STMonadTrans package, unfortunately.

Removed.

I'm not very happy with the API of Data.Text.Encoding. The fact that the only way to abort decoding is to raise an error (and catch it elsewhere) is a joke. Ideally it should be possible to drive decoding from without, e. g.,
decodeUtf8With2 :: ByteString -> ByteString -> Either Int (Text, ByteString)

Done (though not in that exact manner). Check out the test case t_decode_withM_error5 in tests/Tests/Properties/Transcoding.hs. There other examples, too, for Cont and Maybe. What's more is that the error handlers of the additional functions (the ones ending with M) take the byte position where the error occurred.

Jul 01 '22 05:07 david-sledge

This still has the same unsafety as STT, since it's just an inlining of the STT logic. You can cause a segfault by specializing to ContT and by using a continuation twice. Taking a callback is fundamentally flawed, so that the current implementation cannot simply be generalized to be in an arbitrary monad. Changing the API as bodigrim suggested seems like a better option (you may also remember the part that was decoded so far).

Jul 01 '22 06:07 Lysxia

Ahhh... I see what you mean, and it'd be a problem with the list monad, too. However, I noticed that streamDecodeUtf8With returns a continuation in the Decoding data type. I looked how the risk of calling that continuation multiple times is mitigated. Essentially, the continuation a wholly separate invocation of a function. It calls runST in which the text array is created, populated, frozen, embedded in the Text data type, and returned without being modified any further.

So I applied the same concept around the call to the error handler. Before the error handler is called, it runs through the processes just before exiting; the text array is frozen, no longer modified, and the reference to last state value is discarded. Then after the error handler returns, it simulates runST with a call to runRW#, makes a copy of the text array where it left off, and continues its operations on the copy.

Jul 02 '22 01:07 david-sledge

Changing the API to not take a callback at all and instead return an Either that tells you exactly where it stopped would still be much simpler, safer, and faster. I would prefer to consider that first.

Passing the State# token explicitly like this, while in an arbitrary monad, makes this quite difficult to review with confidence.

Jul 02 '22 07:07 Lysxia

While I understand the desire to use Either, that doesn't really get me closer to my goal, which is why I was going for arbitrary monad which would meet my goal and allow for Either. Since insert-your-favorite-monads-here option has been removed from the table, I found myself at an impasse. So I took a step back, looked at the issue again, and reevaluated the problem.

With the requirements and constraints of

Allow a means to abort decoding without raising an error.
No STT.
No arbitrary monad implementation.
No manually passing the State# token.
Indicate position in source ByteStream where the error occurs.

I think I found a solution that'll fit the bill.

Jul 03 '22 05:07 david-sledge

This is actually close to what we were suggesting, and indeed much more satisfactory.

Jul 03 '22 08:07 Lysxia

Excellent! A couple of questions:

The byte position does not currently track across continuation invocations and instead resets back to zero. What's the opinion on whether it should track or the caller be responsible for tracking the cumulative byte position/total bytes read?
What version should the @since annotation be for streamDecodeUtf8With'? 2.0.1, 2.1.0, or do I not need to worry about it yet?

Jul 03 '22 12:07 david-sledge

For the first question, I went ahead and let the library track the absolute position for the caller.

Jul 03 '22 19:07 david-sledge

Stream decoders added for utf-16 and utf-32, and an Either decoder for ASCII. I'll work on the lazy options next.

Jul 06 '22 20:07 david-sledge

We are getting closer to the final cut off before GHC 9.4. @Lysxia could you please complete your review on this? I don't have much bandwidth at the moment.

Jul 19 '22 19:07 Bodigrim

@Bodigrim How much time is left for the cut off? I'll allocate more cycles to shepherd this PR (sorry for dragging this @david-sledge !) but I feel like this is going to take a couple more rounds.

Jul 19 '22 21:07 Lysxia

I'm not sure exactly. Given that @bgamari has not chased us yet, it would probably take at least a week or more.

Jul 19 '22 21:07 Bodigrim

Given https://github.com/haskell/text/pull/453#issuecomment-1191279987, I'll cut text-2.0.1 from the current master, and this PR will be delayed until text-2.0.2. I'm happy to release text-2.0.2 fairly soon, in time for GHC 9.4.2. Sorry @david-sledge, I hope it's not too much of delay for your purposes.

Jul 21 '22 19:07 Bodigrim

This branch no longer uses simdutf anymore. Should all of it and references to it be removed, or is it worth modifying simdutf to have a determine_utf8_prefix_length function?

Jul 29 '22 02:07 david-sledge

simdutf is there for performance, so if it is no longer used that needs to be justified by benchmarks. I suspect it's going to be difficult to compete with a vectorized implementation.

Jul 29 '22 06:07 Lysxia

I suspect with the issue you pointed out of where the next to last Word8 is invalid, this reimplementation will perform better, because the current implementation of decodeUtf8With2 has the same issue. However, in instances where the entire ByteString is valid, the current implementation will perform better. So one option is to only call c_is_valid_utf8 once at the beginning of each ByteString evaluation. Another option is what I mentioned previously of adding to simdutf a function that counts the number of contiguous valid UTF-8 Word8s at the beginning of a ByteString.

Jul 29 '22 15:07 david-sledge

So one option is to only call c_is_valid_utf8 once at the beginning of each ByteString evaluation.

As I understand it this is what we would accept currently, we want the "valid" case to be as fast as possible (@Bodigrim is that right?), and slowing down in other cases is fine. Modifying the C++ function to also count how far the valid bytes extend so we don't have to loop in Haskell would be a nice bonus, but it's not a requirement.

Jul 29 '22 16:07 Lysxia

As I understand it this is what we would accept currently, we want the "valid" case to be as fast as possible (@Bodigrim is that right?), and slowing down in other cases is fine.

Correct.

Another option is what I mentioned previously of adding to simdutf a function that counts the number of contiguous valid UTF-8 Word8s at the beginning of a ByteString.

That would be awesome, but I don't know how feasible it is. Maybe someone can check it with simdutf developers?

Jul 29 '22 22:07 Bodigrim

The error in the workflow seems to be due to an unrelated test. I've submitted an issue regarding it.

Aug 10 '22 00:08 david-sledge

@david-sledge could you please rebase?

Aug 19 '22 23:08 Bodigrim

Another option is what I mentioned previously of adding to simdutf a function that counts the number of contiguous valid UTF-8 Word8s at the beginning of a ByteString.

That would be awesome, but I don't know how feasible it is. Maybe someone can check it with simdutf developers?

Following up on this, it looks like the simdutf developers had a similar idea. The next version will have validate_utf8_with_errors which returns a struct containing the byte length, and an error code.

Aug 25 '22 16:08 david-sledge

@Lysxia is it good to go?

Aug 25 '22 21:08 Bodigrim

Sorry this is taking so long @david-sledge. Tell me if you don't want to spend more time on this and I can take over.

In DecodeResult, the Maybe w is not worth the complexity of an extra type argument (which is also extra awkward to use). Even if you encode the invariant that the remaining suffix is not empty that way, the types don't enforce that the suffix is in fact invalid, so the calling code will still end up having to handle the vacuous case where it can't find a reason for the error.

Constructing the suffix by appending in the case where the error happens in the first chunk. Just return the index and let the caller decide what to do.

The handling of chunks gets convoluted in trying to deduplicate the logic between the two chunks. In practice, you're unlikely to just have to decode two arbitrary chunks. Ideally, you'd like to handle one chunk at a time, possibly with some residual state from a previous chunk, which contains just a single incomplete code point. In fact, the first chunk could rather be the decoder state at the end of the previous chunk, which also lets you throw the previous chunk away completely before getting to the next chunk.

Stylistic nit: it's easier to read case expressions with disjoint patterns (Just _ | Nothing) rather than wildcards (Just _ | _), because I don't have to look up previous clauses to find out what constructor is being handled.

Ok, I've a design that'll address your points including the quadratic time issue. It'll be a bit of a rewrite, so I'm going to convert this to a draft.

Sep 14 '22 23:09 david-sledge

Sorry this is taking so long @david-sledge. Tell me if you don't want to spend more time on this and I can take over.

I'm actually rather enjoying this. Though I'm finding I have less time to work on it than I did before.

In DecodeResult, the Maybe w is not worth the complexity of an extra type argument (which is also extra awkward to use). Even if you encode the invariant that the remaining suffix is not empty that way, the types don't enforce that the suffix is in fact invalid, so the calling code will still end up having to handle the vacuous case where it can't find a reason for the error.

Removed.

Constructing the suffix by appending in the case where the error happens in the first chunk. Just return the index and let the caller decide what to do.

Errors are now returned as an Int which represents the start position of the erroneous code point.

The handling of chunks gets convoluted in trying to deduplicate the logic between the two chunks. In practice, you're unlikely to just have to decode two arbitrary chunks. Ideally, you'd like to handle one chunk at a time, possibly with some residual state from a previous chunk, which contains just a single incomplete code point. In fact, the first chunk could rather be the decoder state at the end of the previous chunk, which also lets you throw the previous chunk away completely before getting to the next chunk.

Can't throw it completely away. The values themselves need to be preserved in case the next ByteString proves them to be valid. And can't copy it yet, because it could turn out to be invalid. However, I did change the logic so that the code point decoder state at the end of the previous chunk is that starting state of the next. So it doesn't have to go back and look at it again, except to (a) copy it into the Text when it's valid and (b) when it's invalid start evaluating the Word8 at error position + 1 as the start of a code point. Also, the new functions don't construct any new ByteStrings from the input (with one very small exception).

Stylistic nit: it's easier to read case expressions with disjoint patterns (Just _ | Nothing) rather than wildcards (Just _ | _), because I don't have to look up previous clauses to find out what constructor is being handled.

Made the default (_ -> ...) case patterns explicit.

And lastly...

All of these append are going take quadratic time if you try to leniently decode a completely invalid string.

I took into consideration this issue with streamDecodeUtf8With in the redesign.

I'm not yet ready to take out of draft status yet, but there's enough to see the approach I'm taking. Also, I rolled back the changes to UTF-16 and 32 decoders for the moment, and just focused on UTF-8.

Oct 02 '22 15:10 david-sledge

Thanks, I'll have a look this weekend.

Oct 02 '22 18:10 Lysxia

Okay, now it's ready for review.

Oct 06 '22 04:10 david-sledge

text text copied to clipboard

Allow caller more control in stream decoding

text
text copied to clipboard