lark icon indicating copy to clipboard operation
lark copied to clipboard

A streamable parser

Open swamidass opened this issue 1 year ago • 8 comments

Suggestion

Sometimes we need to parse and transform a very large stream that can't be fit into memory all at once. Lark has a way of transforming a tree from a parser, without constructing the tree: https://lark-parser.readthedocs.io/en/latest/json_tutorial.html#part-5-optimizing , using this API:

json_parser = Lark(json_grammar, start='value', parser='lalr', transformer=TreeToJson()) print( json_parser.parse(f.read()) )

The problem here is that f.read() requires reading the whole string into memory. Ideally, the API would allow something like this:

json_parser = Lark(json_grammar, start='value', parser='lalr', transformer=TreeToJson()) for emitted_object in json_parser.stream(readable_stream): work(emitted_object)

Here, TreeToJson could be implemented to (for example) emit objects that match a jsonpath pattern. This would require extending the transformer API so that the transformer has a way of notifying the parser when an object is ready to be emitted.

Less ideally, but still usable, syntax like this could be used. The advantage of this would be that the parser-transformer internals might not need to be changed much at all. Though it would require users of the library to be clever enough in how they write the transformer object to make this possible:

json_parser = Lark(json_grammar, start='value', parser='lalr', transformer=transformer) json_parser.stream(readable_stream) for emitted_object in iter(transformer): work(emitted_object)

If this could be done, it would make Lark capable of parsing infinite inputs (e.g. an infinite stream of JSON objects) with limited memory and all the benefits to performance that come from this pattern.

Describe alternatives you've considered

It might be possible to implement this as a function or library that operates on top of lark. However, such a function would require some intimate knowledge of the non-public API of lark to work. The ideal situation would be for Lark itself to implement this function.

swamidass avatar Nov 04 '22 21:11 swamidass