ohm icon indicating copy to clipboard operation
ohm copied to clipboard

Streaming Large Files

Open pdubroy opened this issue 8 months ago • 3 comments

Discussed in https://github.com/ohmjs/ohm/discussions/468

Originally posted by metawrap-dev February 21, 2024 Great project! I'm using it in something very interesting.

I'm wondering if it would be possible to modify InputStream and some of the semantics around checking eof, and stream data in chunks into the parser from an actual stream?

The goal is that the whole source file is never loaded at once into memory,

Use cases: Parsing an infinite network stream and parsing very large files.

pdubroy avatar Jun 23 '25 12:06 pdubroy

+1 :)

metawrap-dev avatar Jun 23 '25 14:06 metawrap-dev

@metawrap-dev I've been thinking about this some more. A few thoughts:

  • In packrat parsing, which Ohm uses, the memo table almost always takes up much more memory that the input itself. So for the very large files use case, there would be a lot more benefit in figuring out ways to reduce the amount of memory used by the memo table.
  • In either case, dealing with the input stream is relatively easy, but we'd also have to handle unloading memo table data. It's tricky to figure out how to do that without adding some kind of notion of "cut points" (to limit backtracking) to the grammars.
  • A complication with the infinite network stream case is that we track the absolute value of the input position, and this will eventually overflow.

My current thought is — it should be possible to parse a very large or infinite stream with Ohm, but Ohm itself shouldn't handle it transparently.

So, I'd prefer that we first have an example of something that doesn't work, and we see what needs to be changed in Ohm to make it work. E.g., one thing is that we'd probably need is the ability to tell Ohm "match as much as you can", rather than assuming that the entire input must be consumed (see #487)

pdubroy avatar Jul 29 '25 10:07 pdubroy

Good luck!

So, what likely won't work is something that is randomly and continuously deep and complex?

What would work is something mostly shallow? (Which I guess would be the general use case for a "feed"... with some rare excursions into complexity for some elements.

As long as you can assume the feed consists of a number of independent top-level documents, you should be able to tear down everything for the next document in the feed?

If you want the feed to be infinite within the grammar of a single document, then I can see where that might be tricky if you are accumulating resources wrt. breadth and depth.

metawrap-dev avatar Aug 18 '25 19:08 metawrap-dev