ocaml-re icon indicating copy to clipboard operation
ocaml-re copied to clipboard

Streaming Re

Open Drup opened this issue 10 years ago • 5 comments

This is a (set of) notes after a discussion with @vouillon on how to make re able to stream.

  • We should move pos and last out of the info record and pass them around explicitly, in particular in loop. Important: check spilling in the loop function.
  • Partial would give an abstract type partial containing
    • an Re.state
    • a buffer of some sort
    • the current position in the buffer
  • We would expose two functions:
    • Some function adding some new content to the buffer.
    • Some function taking partial and starting the matching again. This would be implemented using the loop function to match more things and then the Re_automaton.status function.

It should also be possible to say "The streaming is finished, you can match eol/eos/stop".

There are delicate questions of content copying when initializing and refilling the buffer. In particular, copying the matched string to initialize the buffer is clearly not acceptable.

Drup avatar May 12 '15 14:05 Drup

Preferably the interface would be something that works for the following scenario:

  • Regular expression (l+) and input chunk "hello" -> one substring
  • Resume matching with new data: "hell" -> zero substrings, but the partial match will be matched later on.

Bonus: If the system doesn't store the contents of the substring somewhere (perhaps just by partial matches referring to each fragment they are composed of), then there should be a way for the user of the library to do so. For instance, for very long matches the client could choose to forget parts of them or put them to a storage other than memory. Or is this too rare of a requirement? For 99.9% of cases the matches are going to be short.

eras avatar May 12 '15 15:05 eras

Resume matching with new data: "hell" -> zero substrings, but the partial match will be matched later on.

That would require a specific API for partial matches, not just the current API slightly augmented.

I don't understand the bonus.

Drup avatar May 12 '15 16:05 Drup

I meant "Bonus" as in a feature that is probably not often useful, but use cases could be found.

For example: I could write a pattern that optimistically matches certain kind of network traffic from an unframed network capture. The matches could possibly be of unbounded length, if the input stream is infinite. I am still be able to find the substrings - that may span multiple units of processing - from a capture file, even if I cannot hold the whole capture in memory.

eras avatar May 12 '15 18:05 eras

To summarize: you want manual control over the internal buffer.

Drup avatar May 12 '15 18:05 Drup

It should also be possible to say "The streaming is finished, you can match eol/eos/stop".

Do you also need to mark the beginning of a stream? So that bol,bos,start match as well.

How will group capture that spans across chunks will work? Or will it be possible at all.

rgrinberg avatar Sep 23 '16 14:09 rgrinberg