conduit
conduit copied to clipboard
Add split function to Data.Text.Conduit (closes #205)
Don't merge; this is missing changelog updates and documentation. I just thought these tests could address @snoyberg's comment about not splitting correctly when the input is chunked. This test in particular:
it "handles separators on a chunk boundary" $ do
(CL.sourceList ["aX","Xb"] C.$= CT.split "XX" C.$$ CL.consume) ==
[["a","b"]]
OK, I see it now, you're correct. The problematic thing is that this will take up unbounded memory unfortunately, which ideally would not be necessary. An algorithm like those available in http://www.stackage.org/package/stringsearch would be necessary to avoid that.
@snoyberg I don't fully follow, could you clarify? It seems like taking up unbounded memory is inherent to this type of function since the separator may never come, and the text read so far has to be kept around in case it does. lines
has the same issue in the case where there's no newline separator.
It seems like the algorithms in stringsearch could help with running time performance but not necessarily memory usage?
On the other hand, I am getting really high memory usage when running this function on a 50 MB file that doesn't have the specified separator in it.
Imagine you're looking to break on the string "abc". If I get a chunk "qwertya", I know that "qwerty" will be before the chunk boundary, but "a" may be the beginning of it. I can therefore break it off, send it downstream, and continue. If the next chunk is "bc...", then I've found a boundary. If it starts with anything else, I know that it's not a boundary.
You're completely correct about lines
having unbounded memory usage, which is why it should only be used in data that you have control over (or with some extra method to ensure memory protection). Instead, a function like line is probably a better interface for a function like this, since it allows for completely deterministic memory usage.
I will attempt to summarize:
@snoyberg is pointing out that splitOn
is not streaming: it builds up a list of results
@MaxGabriel is pointing out that in terms of memory consumption that should only worst case double the memory consumption because the original string is still in memory
It seems to me that
- to avoid unbounded memory usage this would need to ask for a text of bounded length.
- this function should avoid the doubling of memory and send results back immediately.
@gregwebs I'm not positive that's what @snoyberg was saying, but I don't think that Data.Text.splitOn
doubles memory usage. Its implementation takes in a Text
value (backed by an Array
) to be split up. Then, instead of creating new arrays for the split-up substrings, it uses the original array + different indices for the beginning and end of the text.
I think that's that this line from the Data.Text haddocks is getting at:
Splitting functions in this library do not perform character-wise copies to create substrings; they just construct new Texts that are slices of the original.
To clarify my thoughts: the currently proposed API necessitates collecting multiple chunks into memory until it finds a boundary. This defeats the streaming nature of conduit. Instead, I'd prefer an API that provides individual streams for each piece of text between chunks. This is very similar to the difference between the lines
and line
functions in conduit-combinators.
@MaxGabriel thanks for pointing to the docs. I think the worst case could still be greater than 2x memory usage though: if every other character is split that creates a lot of new Text objects.