CSV.jl
CSV.jl copied to clipboard
Allow lazy materialization of iterated CSV.Chunks
While discussing out-of-core data processing with @bkamins, we realized it's currently a bit awkward to work with CSV.Chunks. For context, CSV.Chunks is currently structured to:
- parse the header and initial data positions for each chunk in
CSV.Chunksconstructor - When
CSV.Chunksis iterated, theCSV.Context+ byte position is passed toCSV.Filefor parsing andCSV.Fileis returned
That means it's really difficult, if not impossible, if you wanted to spawn the chunk parsing to a separate thread/remote process, since it's baked in to the call to iterate. It would be more helpful if we structured it more similarly to Tables.partitioner(f, iter), where f(x) generates a LazyTable and the table isn't materialized until Tables.columns or Tables.rows is called. So the API for CSV.Chunks would become:
chunks = CSV.Chunks(...)x, st = iterate(chunks)file = CSV.materialize(x)wherematerializetakesx, which would be some kind ofCSV.Chunkobject that noted theCSV.Contextand starting byte position, andmaterializewould call the actualCSV.Fileparsing
So one thing we should consider is whether we should do a breaking change to CSV.Chunks (where iteration returns CSV.Chunk instead of current CSV.File, and required extra materialization call). Or make a new chunk iterator type. Or somehow add a keyword arg that would control whether a CSV.Chunk or CSV.File is returned when iterating.
The question is if it is possible to make it in a lazy way. What I mean that it is possible (to be confirmed) that you actually have to parse the data to verify where the "byte position" for a given chunk should be (but maybe this parsing can be done fast?).