alex icon indicating copy to clipboard operation
alex copied to clipboard

feature request: support incremental/streaming lexing

Open cartazio opened this issue 10 years ago • 11 comments
trafficstars

in a number of application domains, I need to deal with handling streaming inputs in an incremental fashion, and having a streaming lexer / tokenization layer helps immensely with writing the layers on top.

If adding such capabilities to Alex are viable, i'd be very interested in trying to help add them. (rather than having to reinvent a lot of the tooling that alex provides)

would this be a feature you'd be open to having added? @simonmar ?

cartazio avatar Jun 29 '15 04:06 cartazio

even better would be that alex already tacitly supports this and i'm simply not understanding it yet :)

cartazio avatar Jun 29 '15 04:06 cartazio

I'd happily accept a patch, provided it doesn't compromise the speed of non-streaming lexers.

simonmar avatar Jun 29 '15 09:06 simonmar

@cartazio In many cases this can already be made to work, though it requires knowing something about the maximum token length. For example we have implemented a streaming JSON lexer using alex. This relies on the fact that there's largest possible token length (around 6 bytes iirc for JSON) so that we can tell when we get to the end of a chunk if the lexer returning an error is due to running out of input or a real failure. If it fails within 6 bytes of the end then we need to supply more input and try again, but if there's more input available than that then it's a real lex error.

dcoutts avatar Sep 15 '16 14:09 dcoutts

Interesting. I have many questions :) Where is your Alex lexer for JSON? Do you have a parser too? Is it faster than aeson?

simonmar avatar Sep 16 '16 13:09 simonmar

I have a properly streaming one I wrote at work a year ago that has way better memory behavior and incremental ingestion.

On Friday, September 16, 2016, Simon Marlow [email protected] wrote:

Interesting. I have many questions :) Where is your Alex lexer for JSON? Do you have a parser too? Is it faster than aeson?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/simonmar/alex/issues/67#issuecomment-247596656, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAQwkL3FPgNNe5vx_iF9CC0VB9tH8a2ks5qqpWHgaJpZM4FN1Sm .

cartazio avatar Sep 16 '16 14:09 cartazio

I am very happy for you.

simonmar avatar Sep 16 '16 14:09 simonmar

I can see about cleaning it up and getting thst into hackage if you want :)

On Friday, September 16, 2016, Simon Marlow [email protected] wrote:

I am very happy for you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/simonmar/alex/issues/67#issuecomment-247617742, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAQwqcqLKOKjgk5ws8lTYNjJU4K19LWks5qqqorgaJpZM4FN1Sm .

cartazio avatar Sep 16 '16 15:09 cartazio

I got something working that is pull-based, and I'd be happy to try and get it cleaned up and merged.

You supply some monadic action that can be used to get additional data, and a maxmimum token length.

The lexer treats an empty result from the action as EOF. If there is a lex error it checks for additional data and rescans if the data is less than the user-supplied maximum token length. It also attempts to get more data at EOF.

There is probably room for improvement to differentiate errors that are occurring because of EOF and other errors, but this is a rough first cut.

It is currently only working for bytestrings, with code borrowed from the monad template. It could accomadate userstate fairly readily, but I didn't need that, so it's not written.

iteratee avatar Dec 20 '22 23:12 iteratee

Ooo, this sounds amazing !

cartazio avatar Dec 21 '22 03:12 cartazio

https://github.com/cartazio/streaming-machine-json This repo has the parser I mentioned

cartazio avatar Dec 21 '22 03:12 cartazio

@iteratee If this is fully backwards-compatible and does not affect performance of what we have now, a PR would be welcome!

@simonmar wrote:

I'd happily accept a patch, provided it doesn't compromise the speed of non-streaming lexers.

andreasabel avatar Apr 14 '23 19:04 andreasabel