text
                                
                                
                                
                                    text copied to clipboard
                            
                            
                            
                        Allow caller more control in stream decoding
I've been wanting this feature for a while, so I decided to prototype it and create a pull request.
Rather than STT, safer alternatives would be to specialize this to one specific monad (was state the original motivation?) or to use a different algorithm that does not involve STT.
Rather than STT, safer alternatives would be to specialize this to one specific monad (was state the original motivation?) or to use a different algorithm that does not involve STT.
State (along with reader) was secondary. My primary motivation was to multiple monads in a stack (I use IO and continuation frequently). Specializing is not ideal, since there'd have to be a different one not just for each monad, but each combination of monads.
STT is notably unsafe with continuations.
text is a boot library, it cannot depend on STMonadTrans package, unfortunately.
I'm not very happy with the API of Data.Text.Encoding. The fact that the only way to abort decoding is to raise an error (and catch it elsewhere) is a joke. Ideally it should be possible to drive decoding from without, e. g.,
decodeUtf8With2 :: ByteString -> ByteString -> Either Int (Text, ByteString)
Then clients are free to interpret Int (the position, where decoding failed) as they like, in whatever monad.
textis a boot library, it cannot depend onSTMonadTranspackage, unfortunately.
Removed.
I'm not very happy with the API of
Data.Text.Encoding. The fact that the only way to abort decoding is to raise anerror(andcatchit elsewhere) is a joke. Ideally it should be possible to drive decoding from without, e. g.,decodeUtf8With2 :: ByteString -> ByteString -> Either Int (Text, ByteString)
Done (though not in that exact manner). Check out the test case t_decode_withM_error5 in tests/Tests/Properties/Transcoding.hs. There other examples, too, for Cont and Maybe. What's more is that the error handlers of the additional functions (the ones ending with M) take the byte position where the error occurred.
This still has the same unsafety as STT, since it's just an inlining of the STT logic. You can cause a segfault by specializing to ContT and by using a continuation twice.
Taking a callback is fundamentally flawed, so that the current implementation cannot simply be generalized to be in an arbitrary monad. Changing the API as bodigrim suggested seems like a better option (you may also remember the part that was decoded so far).
Ahhh... I see what you mean, and it'd be a problem with the list monad, too. However, I noticed that streamDecodeUtf8With returns a continuation in the Decoding data type. I looked how the risk of calling that continuation multiple times is mitigated. Essentially, the continuation a wholly separate invocation of a function. It calls runST in which the text array is created, populated, frozen, embedded in the Text data type, and returned without being modified any further.
So I applied the same concept around the call to the error handler. Before the error handler is called, it runs through the processes just before exiting; the text array is frozen, no longer modified, and the reference to last state value is discarded. Then after the error handler returns, it simulates runST with a call to runRW#, makes a copy of the text array where it left off, and continues its operations on the copy.
Changing the API to not take a callback at all and instead return an Either that tells you exactly where it stopped would still be much simpler, safer, and faster. I would prefer to consider that first.
Passing the State# token explicitly like this, while in an arbitrary monad, makes this quite difficult to review with confidence.
While I understand the desire to use Either, that doesn't really get me closer to my goal, which is why I was going for arbitrary monad which would meet my goal and allow for Either. Since insert-your-favorite-monads-here option has been removed from the table, I found myself at an impasse. So I took a step back, looked at the issue again, and reevaluated the problem.
With the requirements and constraints of
- Allow a means to abort decoding without raising an 
error. - No STT.
 - No arbitrary monad implementation.
 - No manually passing the 
State#token. - Indicate position in source 
ByteStreamwhere the error occurs. 
I think I found a solution that'll fit the bill.
This is actually close to what we were suggesting, and indeed much more satisfactory.
Excellent! A couple of questions:
- The byte position does not currently track across continuation invocations and instead resets back to zero. What's the opinion on whether it should track or the caller be responsible for tracking the cumulative byte position/total bytes read?
 - What version should the 
@sinceannotation be forstreamDecodeUtf8With'? 2.0.1, 2.1.0, or do I not need to worry about it yet? 
For the first question, I went ahead and let the library track the absolute position for the caller.
Stream decoders added for utf-16 and utf-32, and an Either decoder for ASCII. I'll work on the lazy options next.
We are getting closer to the final cut off before GHC 9.4. @Lysxia could you please complete your review on this? I don't have much bandwidth at the moment.
@Bodigrim How much time is left for the cut off? I'll allocate more cycles to shepherd this PR (sorry for dragging this @david-sledge !) but I feel like this is going to take a couple more rounds.
I'm not sure exactly. Given that @bgamari has not chased us yet, it would probably take at least a week or more.
Given https://github.com/haskell/text/pull/453#issuecomment-1191279987, I'll cut text-2.0.1 from the current master, and this PR will be delayed until text-2.0.2. I'm happy to release text-2.0.2 fairly soon, in time for GHC 9.4.2. Sorry @david-sledge, I hope it's not too much of delay for your purposes.
This branch no longer uses simdutf anymore. Should all of it and references to it be removed, or is it worth modifying simdutf to have a determine_utf8_prefix_length function?
simdutf is there for performance, so if it is no longer used that needs to be justified by benchmarks. I suspect it's going to be difficult to compete with a vectorized implementation.
I suspect with the issue you pointed out of where the next to last Word8 is invalid, this reimplementation will perform better, because the current implementation of decodeUtf8With2 has the same issue. However, in instances where the entire ByteString is valid, the current implementation will perform better. So one option is to only call c_is_valid_utf8 once at the beginning of each ByteString evaluation. Another option is what I mentioned previously of adding to simdutf a function that counts the number of contiguous valid UTF-8 Word8s at the beginning of a ByteString.
So one option is to only call
c_is_valid_utf8once at the beginning of eachByteStringevaluation.
As I understand it this is what we would accept currently, we want the "valid" case to be as fast as possible (@Bodigrim is that right?), and slowing down in other cases is fine. Modifying the C++ function to also count how far the valid bytes extend so we don't have to loop in Haskell would be a nice bonus, but it's not a requirement.
As I understand it this is what we would accept currently, we want the "valid" case to be as fast as possible (@Bodigrim is that right?), and slowing down in other cases is fine.
Correct.
Another option is what I mentioned previously of adding to simdutf a function that counts the number of contiguous valid UTF-8 Word8s at the beginning of a ByteString.
That would be awesome, but I don't know how feasible it is. Maybe someone can check it with simdutf developers?
The error in the workflow seems to be due to an unrelated test. I've submitted an issue regarding it.
@david-sledge could you please rebase?
Another option is what I mentioned previously of adding to simdutf a function that counts the number of contiguous valid UTF-8 Word8s at the beginning of a ByteString.
That would be awesome, but I don't know how feasible it is. Maybe someone can check it with simdutf developers?
Following up on this, it looks like the simdutf developers had a similar idea. The next version will have validate_utf8_with_errors which returns a struct containing the byte length, and an error code.
@Lysxia is it good to go?
Sorry this is taking so long @david-sledge. Tell me if you don't want to spend more time on this and I can take over.
- In
 DecodeResult, theMaybe wis not worth the complexity of an extra type argument (which is also extra awkward to use). Even if you encode the invariant that the remaining suffix is not empty that way, the types don't enforce that the suffix is in fact invalid, so the calling code will still end up having to handle the vacuous case where it can't find a reason for the error.- Constructing the suffix by appending in the case where the error happens in the first chunk. Just return the index and let the caller decide what to do.
 - The handling of chunks gets convoluted in trying to deduplicate the logic between the two chunks. In practice, you're unlikely to just have to decode two arbitrary chunks. Ideally, you'd like to handle one chunk at a time, possibly with some residual state from a previous chunk, which contains just a single incomplete code point. In fact, the first chunk could rather be the decoder state at the end of the previous chunk, which also lets you throw the previous chunk away completely before getting to the next chunk.
 - Stylistic nit: it's easier to read
 caseexpressions with disjoint patterns (Just _ | Nothing) rather than wildcards (Just _ | _), because I don't have to look up previous clauses to find out what constructor is being handled.
Ok, I've a design that'll address your points including the quadratic time issue. It'll be a bit of a rewrite, so I'm going to convert this to a draft.
Sorry this is taking so long @david-sledge. Tell me if you don't want to spend more time on this and I can take over.
I'm actually rather enjoying this. Though I'm finding I have less time to work on it than I did before.
- In
 DecodeResult, theMaybe wis not worth the complexity of an extra type argument (which is also extra awkward to use). Even if you encode the invariant that the remaining suffix is not empty that way, the types don't enforce that the suffix is in fact invalid, so the calling code will still end up having to handle the vacuous case where it can't find a reason for the error.
- Removed.
 
- Constructing the suffix by appending in the case where the error happens in the first chunk. Just return the index and let the caller decide what to do.
 
- Errors are now returned as an 
Intwhich represents the start position of the erroneous code point. 
- The handling of chunks gets convoluted in trying to deduplicate the logic between the two chunks. In practice, you're unlikely to just have to decode two arbitrary chunks. Ideally, you'd like to handle one chunk at a time, possibly with some residual state from a previous chunk, which contains just a single incomplete code point. In fact, the first chunk could rather be the decoder state at the end of the previous chunk, which also lets you throw the previous chunk away completely before getting to the next chunk.
 
- Can't throw it completely away. The values themselves need to be preserved in case the next 
ByteStringproves them to be valid. And can't copy it yet, because it could turn out to be invalid. However, I did change the logic so that the code point decoder state at the end of the previous chunk is that starting state of the next. So it doesn't have to go back and look at it again, except to (a) copy it into theTextwhen it's valid and (b) when it's invalid start evaluating theWord8at error position + 1 as the start of a code point. Also, the new functions don't construct any newByteStrings from the input (with one very small exception). 
- Stylistic nit: it's easier to read
 caseexpressions with disjoint patterns (Just _ | Nothing) rather than wildcards (Just _ | _), because I don't have to look up previous clauses to find out what constructor is being handled.
- Made the default (
_ -> ...) case patterns explicit. 
And lastly...
All of these
appendare going take quadratic time if you try to leniently decode a completely invalid string.
I took into consideration this issue with streamDecodeUtf8With in the redesign.
I'm not yet ready to take out of draft status yet, but there's enough to see the approach I'm taking. Also, I rolled back the changes to UTF-16 and 32 decoders for the moment, and just focused on UTF-8.
Thanks, I'll have a look this weekend.
Okay, now it's ready for review.