alex icon indicating copy to clipboard operation
alex copied to clipboard

Strict `Text` wrapper unnecessarily decodes to `Char` for `text-2.*`

Open sol opened this issue 3 months ago • 6 comments

Looking at the code:

https://github.com/haskell/alex/blob/e65958c688ba04f8b14c888d7b55f16532510f51/data/AlexWrappers.hs#L88-L90

This can be replaced by:

type AlexInput = (Int,            -- current offset
                  Data.Text.Text) -- input string

https://github.com/haskell/alex/blob/e65958c688ba04f8b14c888d7b55f16532510f51/data/AlexWrappers.hs#L98-L105

This can be replaced by:

alexGetByte :: AlexInput -> Maybe (Byte,AlexInput)
alexGetByte (cur, input) = case input of
  Text arr off len
    | cur < len = Just (unsafeIndex arr (off + curr) (cur + 1, input))
    | otherwise = Nothing

The only thing "complicated" is alexInputPrevChar, basically you go back from cur one byte at a time until you have seen two character boundaries and then do a character decode at that position.

sol avatar Sep 19 '25 12:09 sol

alexScanTokens also needs to be adjusted:

alexScanTokens input@(Text arr off _) = go (0, input)
  where go cur@(start, _) =
          case alexScan cur 0 of
                AlexEOF -> []
                AlexError _ -> error "lexical error"
                AlexSkip  new _len  -> go new
                AlexToken new _len act -> act match : go new
                   where
                     match = Text arr (off + start) (end - start)
                     (end, _) = new

sol avatar Sep 19 '25 13:09 sol

Also note that the current code

https://github.com/haskell/alex/blob/e65958c688ba04f8b14c888d7b55f16532510f51/data/AlexWrappers.hs#L526

is O(len), while the the new code is O(1).

sol avatar Sep 19 '25 13:09 sol

I modified https://gitlab.com/FinnBender/haskell-parsing-benchmarks to gauge the impact and it roughly improved performance by 25%.

This includes reading the file from disk and parsing. If we would only look at lexing, then the performance difference should be more significant.

I noticed that I get performance close to strict-bytestring, but still somewhat worse. I expected them to be on par. I'm not sure where the additional cycles are spent, possibly it's in the IO part (IIRC, there is currently no way to read from disk directly into a byte array, so I think creating the Text value copies the input one additional time, not sure if this is enough to explain the discrepancy).

sol avatar Sep 19 '25 18:09 sol

I'm not sure where the additional cycles are spent, possibly it's in the IO part (IIRC, there is currently no way to read from disk directly into a byte array, so I think creating the Text value copies the input one additional time, not sure if this is enough to explain the discrepancy).

Did you try https://hackage-content.haskell.org/package/text-2.1.3/docs/Data-Text-IO-Utf8.html ?

Bodigrim avatar Sep 19 '25 19:09 Bodigrim

Did you try https://hackage-content.haskell.org/package/text-2.1.3/docs/Data-Text-IO-Utf8.html ?

Yes, of course. Otherwise I think the discrepancy would be more pronounced.

Taking a quick look at the code now:

  1. It does a sequential scan over the input for UTF-8 validation.
  2. It converts ByteString to StrictTextBuilder and then to Text which is basically copying the input one time.

I imagine that (1) + (2) may very well explain the discrepancy I saw between Text and ByteString.

Somehow yesterday I forgot about (1). Depending on what Alex does internally this may or may not be necessary. If it's not necessary then I guess you could read the input into a ShortByteString and then do an (unsafe) O(1) conversion to Text.

sol avatar Sep 20 '25 00:09 sol

Somehow yesterday I forgot about (1). Depending on what Alex does internally this may or may not be necessary. If it's not necessary then I guess you could read the input into a ShortByteString and then do an (unsafe) O(1) conversion to Text.

I guess if you are trying squeeze out every inch of performance and if your input is regularly coming from a file, then you could do the lexing on ByteString and construct Text values (and do UTF-8 validation, if necessary) in alexScanTokens on AlexToken.

  • With this approach, the input is not copied one additional time as no Text value is constructed initially.
  • Only the parts of the input that you retain in the AST are copied.

One nice property you're getting for free with this approach is that all retained values don't reference the original input anymore so that the input can be garbage collected.

sol avatar Sep 20 '25 01:09 sol