Strict `Text` wrapper unnecessarily decodes to `Char` for `text-2.*`
Looking at the code:
https://github.com/haskell/alex/blob/e65958c688ba04f8b14c888d7b55f16532510f51/data/AlexWrappers.hs#L88-L90
This can be replaced by:
type AlexInput = (Int, -- current offset
Data.Text.Text) -- input string
https://github.com/haskell/alex/blob/e65958c688ba04f8b14c888d7b55f16532510f51/data/AlexWrappers.hs#L98-L105
This can be replaced by:
alexGetByte :: AlexInput -> Maybe (Byte,AlexInput)
alexGetByte (cur, input) = case input of
Text arr off len
| cur < len = Just (unsafeIndex arr (off + curr) (cur + 1, input))
| otherwise = Nothing
The only thing "complicated" is alexInputPrevChar, basically you go back from cur one byte at a time until you have seen two character boundaries and then do a character decode at that position.
alexScanTokens also needs to be adjusted:
alexScanTokens input@(Text arr off _) = go (0, input)
where go cur@(start, _) =
case alexScan cur 0 of
AlexEOF -> []
AlexError _ -> error "lexical error"
AlexSkip new _len -> go new
AlexToken new _len act -> act match : go new
where
match = Text arr (off + start) (end - start)
(end, _) = new
Also note that the current code
https://github.com/haskell/alex/blob/e65958c688ba04f8b14c888d7b55f16532510f51/data/AlexWrappers.hs#L526
is O(len), while the the new code is O(1).
I modified https://gitlab.com/FinnBender/haskell-parsing-benchmarks to gauge the impact and it roughly improved performance by 25%.
This includes reading the file from disk and parsing. If we would only look at lexing, then the performance difference should be more significant.
I noticed that I get performance close to strict-bytestring, but still somewhat worse. I expected them to be on par. I'm not sure where the additional cycles are spent, possibly it's in the IO part (IIRC, there is currently no way to read from disk directly into a byte array, so I think creating the Text value copies the input one additional time, not sure if this is enough to explain the discrepancy).
I'm not sure where the additional cycles are spent, possibly it's in the IO part (IIRC, there is currently no way to read from disk directly into a byte array, so I think creating the
Textvalue copies the input one additional time, not sure if this is enough to explain the discrepancy).
Did you try https://hackage-content.haskell.org/package/text-2.1.3/docs/Data-Text-IO-Utf8.html ?
Did you try https://hackage-content.haskell.org/package/text-2.1.3/docs/Data-Text-IO-Utf8.html ?
Yes, of course. Otherwise I think the discrepancy would be more pronounced.
Taking a quick look at the code now:
- It does a sequential scan over the input for UTF-8 validation.
- It converts
ByteStringtoStrictTextBuilderand then toTextwhich is basically copying the input one time.
I imagine that (1) + (2) may very well explain the discrepancy I saw between Text and ByteString.
Somehow yesterday I forgot about (1). Depending on what Alex does internally this may or may not be necessary. If it's not necessary then I guess you could read the input into a ShortByteString and then do an (unsafe) O(1) conversion to Text.
Somehow yesterday I forgot about (1). Depending on what Alex does internally this may or may not be necessary. If it's not necessary then I guess you could read the input into a
ShortByteStringand then do an (unsafe) O(1) conversion toText.
I guess if you are trying squeeze out every inch of performance and if your input is regularly coming from a file, then you could do the lexing on ByteString and construct Text values (and do UTF-8 validation, if necessary) in alexScanTokens on AlexToken.
- With this approach, the
inputis not copied one additional time as noTextvalue is constructed initially. - Only the parts of the
inputthat you retain in the AST are copied.
One nice property you're getting for free with this approach is that all retained values don't reference the original input anymore so that the input can be garbage collected.