JuliaDB.jl icon indicating copy to clipboard operation
JuliaDB.jl copied to clipboard

Naive implementation of chunking reader

Open jpsamaroo opened this issue 6 years ago • 7 comments
trafficstars

Replaces #129

TODO:

  • [x] Wire up chunked reading to loadtable
  • [x] Split blocks across multiple workers
  • [x] Don't scale block size by file size
  • [x] Write blocks to disk as they're read

jpsamaroo avatar Jul 02 '19 13:07 jpsamaroo

Possible steps to solve this problem:

  1. Add a method csvread(io::IO,...) to TextParse
  2. Add a method csvread(ios::Vector{<:IO},...) to TextParse
  3. Use these methods in loadtable_serial. Get tests to pass (this will bring it back to the current master state but will use IO objects in place of files.
  4. Use BlockIO chunking with some heuristics to make loadtable chunking.

shashi avatar Jul 26 '19 20:07 shashi

Per the hackathon discussion, step 1 is already handled; step 2 would probably be a good idea too. 3 and 4 to follow. Additionally, for step 4, we should probably provide a utility function (or just a slightly different kwarg to loadtable) to ensure that the ~nblocks argument~ size of each block doesn't increase with the size of the file, as it does right now. A second kwarg like blockmax might be in order to limit how large any individual block can be.

jpsamaroo avatar Jul 27 '19 19:07 jpsamaroo

I almost forgot, I still need to actually implement incremental saving of read blocks to the output file when specified, otherwise we'll still read the whole CSV's data into memory before serializing back out.

jpsamaroo avatar Dec 04 '19 04:12 jpsamaroo

Quick update for onlookers: the latest commit attempts to split individual files into blocks before calling _loadtable_serial so that each block can be saved to disk (and thus removed from memory) when output !== nothing, before moving to the next block. This was the main reason I picked up this work: to allow loading enormous single CSVs without having to "buy more RAM". Once this part is working, then this PR will be ready for review.

jpsamaroo avatar Dec 04 '19 16:12 jpsamaroo

@tanmaykm @shashi done and ready for review!

jpsamaroo avatar Dec 06 '19 13:12 jpsamaroo

Bump, anyone up for reviewing this?

jpsamaroo avatar Jan 29 '20 20:01 jpsamaroo

Looks like some change in TextParse 1.0 is breaking the ability to pass nrows=1 during header parsing (since this passes locally with a pre-1.0 TextParse).

EDIT: nrows was renamed, to make that kwarg available for what we actually need from TextParse (previous nrows didn't actually do what I was expecting, it's just an optimization mechanism).

jpsamaroo avatar Feb 04 '20 14:02 jpsamaroo