conduit Add split function to Data.Text.Conduit (closes #205)

Don't merge; this is missing changelog updates and documentation. I just thought these tests could address @snoyberg's comment about not splitting correctly when the input is chunked. This test in particular:

it "handles separators on a chunk boundary" $ do
    (CL.sourceList ["aX","Xb"] C.$= CT.split "XX" C.$$ CL.consume) ==
        [["a","b"]]

Mar 29 '15 03:03 MaxGabriel

OK, I see it now, you're correct. The problematic thing is that this will take up unbounded memory unfortunately, which ideally would not be necessary. An algorithm like those available in http://www.stackage.org/package/stringsearch would be necessary to avoid that.

Mar 29 '15 05:03 snoyberg

@snoyberg I don't fully follow, could you clarify? It seems like taking up unbounded memory is inherent to this type of function since the separator may never come, and the text read so far has to be kept around in case it does. lines has the same issue in the case where there's no newline separator.

It seems like the algorithms in stringsearch could help with running time performance but not necessarily memory usage?

On the other hand, I am getting really high memory usage when running this function on a 50 MB file that doesn't have the specified separator in it.

Mar 29 '15 18:03 MaxGabriel

Imagine you're looking to break on the string "abc". If I get a chunk "qwertya", I know that "qwerty" will be before the chunk boundary, but "a" may be the beginning of it. I can therefore break it off, send it downstream, and continue. If the next chunk is "bc...", then I've found a boundary. If it starts with anything else, I know that it's not a boundary.

You're completely correct about lines having unbounded memory usage, which is why it should only be used in data that you have control over (or with some extra method to ensure memory protection). Instead, a function like line is probably a better interface for a function like this, since it allows for completely deterministic memory usage.

Mar 29 '15 20:03 snoyberg

I will attempt to summarize:

@snoyberg is pointing out that splitOn is not streaming: it builds up a list of results @MaxGabriel is pointing out that in terms of memory consumption that should only worst case double the memory consumption because the original string is still in memory

It seems to me that

to avoid unbounded memory usage this would need to ask for a text of bounded length.
this function should avoid the doubling of memory and send results back immediately.

Apr 02 '15 04:04 gregwebs

@gregwebs I'm not positive that's what @snoyberg was saying, but I don't think that Data.Text.splitOn doubles memory usage. Its implementation takes in a Text value (backed by an Array) to be split up. Then, instead of creating new arrays for the split-up substrings, it uses the original array + different indices for the beginning and end of the text.

I think that's that this line from the Data.Text haddocks is getting at:

Splitting functions in this library do not perform character-wise copies to create substrings; they just construct new Texts that are slices of the original.

Apr 02 '15 04:04 MaxGabriel

To clarify my thoughts: the currently proposed API necessitates collecting multiple chunks into memory until it finds a boundary. This defeats the streaming nature of conduit. Instead, I'd prefer an API that provides individual streams for each piece of text between chunks. This is very similar to the difference between the lines and line functions in conduit-combinators.

Apr 02 '15 07:04 snoyberg

@MaxGabriel thanks for pointing to the docs. I think the worst case could still be greater than 2x memory usage though: if every other character is split that creates a lot of new Text objects.

Apr 02 '15 13:04 gregwebs

conduit conduit copied to clipboard

Add split function to Data.Text.Conduit (closes #205)

conduit
conduit copied to clipboard