Streaming Re
This is a (set of) notes after a discussion with @vouillon on how to make re able to stream.
- We should move
posandlastout of theinforecord and pass them around explicitly, in particular inloop. Important: check spilling in theloopfunction. Partialwould give an abstract typepartialcontaining- an
Re.state - a buffer of some sort
- the current position in the buffer
- an
- We would expose two functions:
- Some function adding some new content to the buffer.
- Some function taking
partialand starting the matching again. This would be implemented using theloopfunction to match more things and then theRe_automaton.statusfunction.
It should also be possible to say "The streaming is finished, you can match eol/eos/stop".
There are delicate questions of content copying when initializing and refilling the buffer. In particular, copying the matched string to initialize the buffer is clearly not acceptable.
Preferably the interface would be something that works for the following scenario:
- Regular expression (l+) and input chunk "hello" -> one substring
- Resume matching with new data: "hell" -> zero substrings, but the partial match will be matched later on.
Bonus: If the system doesn't store the contents of the substring somewhere (perhaps just by partial matches referring to each fragment they are composed of), then there should be a way for the user of the library to do so. For instance, for very long matches the client could choose to forget parts of them or put them to a storage other than memory. Or is this too rare of a requirement? For 99.9% of cases the matches are going to be short.
Resume matching with new data: "hell" -> zero substrings, but the partial match will be matched later on.
That would require a specific API for partial matches, not just the current API slightly augmented.
I don't understand the bonus.
I meant "Bonus" as in a feature that is probably not often useful, but use cases could be found.
For example: I could write a pattern that optimistically matches certain kind of network traffic from an unframed network capture. The matches could possibly be of unbounded length, if the input stream is infinite. I am still be able to find the substrings - that may span multiple units of processing - from a capture file, even if I cannot hold the whole capture in memory.
To summarize: you want manual control over the internal buffer.
It should also be possible to say "The streaming is finished, you can match eol/eos/stop".
Do you also need to mark the beginning of a stream? So that bol,bos,start match as well.
How will group capture that spans across chunks will work? Or will it be possible at all.