megaparsack
megaparsack copied to clipboard
parse-tokens with streams
It could be useful to have parse-tokens
(parse
) work with stream/c
in conjuction with listof
for performance reasons (e.g. without producing all the tokens before reporting errors at the very beginning)
I also find this useful -- or is there already a way of lazily parsing a file?
I agree with this idea in principle, but in practice, Racket streams seem to be relatively slow. I’ve been wanting to figure out how to create a more efficient streaming abstraction that would help with this issue, but I haven’t gotten around to it.
hum, that makes sense.. thanks for the explanation!
so normally I'd just read up the whole file as string and pass it to parse-string? or does the library offer a another way of parsing files? (do you know any good examples of libraries that use this one so I can learn by example?)
Yes, right now, parsing with megaparsack requires being able to hold the whole file in memory. This is definitely suboptimal, but to be fair, the parsec strategy for parsing often leads to this kind of memory-hungry behavior even when the contents are streamed. Why is that? Well, the whole stream has to be retained whenever the parser encounters a choice point, since it has to be able to backtrack. I wouldn’t necessarily recommend using a parsec-style parser if you need to parse gigabytes of data.
That said, the megaparsack implementation could and should be more optimized. I’ve mostly just held off on doing so because I just haven’t needed to, and I think any optimizations I add should be driven by concrete use cases. Feel free to report bugs with example programs that are unacceptably slow if you find them. However, also keep in mind that megaparsack prioritizes API niceness over performance, and if you really need something speedy, parsack might be a better choice.
As for examples, this very repo includes an example JSON parser, which parses most JSON (though it isn’t completely standards-compliant), and my own tulip-racket repo includes another example use. For examples that aren’t mine, I’m also aware of Konrad Hinsen’s leibniz language, which uses megaparsack internally.
I wouldn’t necessarily recommend using a parsec-style parser if you need to parse gigabytes of data.
indeed! this is not my use-case, so it's no problem having the files in-memory!
I add should be driven by concrete use cases. Feel free to report bugs with example programs that are unacceptably slow if you find them.
that's very wise; I'll make sure to open an issue if that happens! thank you!
thanks for the examples, I'll make sure to check them out!