haskell-tree-sitter icon indicating copy to clipboard operation
haskell-tree-sitter copied to clipboard

Streaming interface

Open patrickt opened this issue 4 years ago • 7 comments

Right now, all parsing assumes you have the entire source file paged into memory as a strict ByteString. I would love to see an option that incorporated streaming-bytestring, which is IME the best way to express lazily-read bytestreams:

parseStreaming :: Unmarshal t => Ptr TS.Language -> Streaming.ByteString.ByteString IO () -> IO (Either String t)

patrickt avatar Sep 24 '19 13:09 patrickt

I’m generally in favour of streaming/incrementality, but I think there are a couple of challenges here we’ll have to overcome for this to actually result in any performance improvements:

  1. The resulting AST will be entirely resident in memory.
  2. Almost anything reported to a user will require arbitrary slices from the source.

robrix avatar Sep 24 '19 16:09 robrix

  1. Yeah, there’s not much we can do about that. At least we can have predictable resource consumption when ingesting streams of bytes.
  2. I have done something like this. What you do is copy the input bytestring, feeding the copy into the parser and keeping the original around from which to slice.

patrickt avatar Sep 24 '19 20:09 patrickt

  • I have done something like this. What you do is copy the input bytestring, feeding the copy into the parser and keeping the original around from which to slice.

I think I must be misunderstanding you; wouldn’t that mean we’ve got the entire file in memory twice?

robrix avatar Sep 24 '19 21:09 robrix

Nope, copying a streaming bytestring is O(1). It doesn’t involve the actual copying of concrete chunks, just the constructors associated with them.

patrickt avatar Sep 24 '19 21:09 patrickt

Ah, ok; but we still end up having the whole thing paged into memory, right? Which kind of obviates the value of streaming, no?

robrix avatar Sep 25 '19 01:09 robrix

Ah, ok; but we still end up having the whole thing paged into memory, right? Which kind of obviates the value of streaming, no?

Answering my own question: we don’t keep the whole thing paged in at once unless we slice the whole thing. We (presumably? I don’t actually know this for sure) lose O(1) slicing, but we limit space consumption for the common case of slicing only parts of the document out.

Losing O(1) slicing could still be a challenge, but it’s one we can probably work with since we don’t tend to traverse the tree random-access.

Plus we might also be able to take advantage of tree-sitter’s incrementality at some point, albeit at the cost of keeping the Tree around.

Ok, I think I have my head wrapped around this now; sounds like a plan!

robrix avatar Sep 25 '19 14:09 robrix

In production don't we already have a maximum file size that we're willing to process? I think that's true both when we receive blob content via a network RPC call and when we load the blob from disk. (/cc @tclem @joshvera to confirm) If so, the limit should be small enough that paging an entire copy of the file into memory isn't that big of an issue. We're not dealing with multi-GB data files here.

dcreager avatar Sep 25 '19 14:09 dcreager