CSV.jl Allow lazy materialization of iterated CSV.Chunks

Allow lazy materialization of iterated CSV.Chunks

Open quinnj opened this issue 3 years ago • 2 comments

While discussing out-of-core data processing with @bkamins, we realized it's currently a bit awkward to work with CSV.Chunks. For context, CSV.Chunks is currently structured to:

parse the header and initial data positions for each chunk in CSV.Chunks constructor
When CSV.Chunks is iterated, the CSV.Context + byte position is passed to CSV.File for parsing and CSV.File is returned

That means it's really difficult, if not impossible, if you wanted to spawn the chunk parsing to a separate thread/remote process, since it's baked in to the call to iterate. It would be more helpful if we structured it more similarly to Tables.partitioner(f, iter), where f(x) generates a LazyTable and the table isn't materialized until Tables.columns or Tables.rows is called. So the API for CSV.Chunks would become:

chunks = CSV.Chunks(...)
x, st = iterate(chunks)
file = CSV.materialize(x) where materialize takes x, which would be some kind of CSV.Chunk object that noted the CSV.Context and starting byte position, and materialize would call the actual CSV.File parsing

Jan 24 '22 17:01 quinnj

So one thing we should consider is whether we should do a breaking change to CSV.Chunks (where iteration returns CSV.Chunk instead of current CSV.File, and required extra materialization call). Or make a new chunk iterator type. Or somehow add a keyword arg that would control whether a CSV.Chunk or CSV.File is returned when iterating.

Jan 24 '22 17:01 quinnj

The question is if it is possible to make it in a lazy way. What I mean that it is possible (to be confirmed) that you actually have to parse the data to verify where the "byte position" for a given chunk should be (but maybe this parsing can be done fast?).

Jan 24 '22 18:01 bkamins

CSV.jl CSV.jl copied to clipboard

Allow lazy materialization of iterated CSV.Chunks

CSV.jl
CSV.jl copied to clipboard