vaex
vaex copied to clipboard
feat: Out of core CSV support using Apache Arrow CSV reader (fast 🔥!)
@JovanVeljanoski we need to discuss this, how we expose this. Questions
- Do we always want to have lazy csv reading? Or if below say 20% of available RAM, load into memory directly? Or special methods? vaex.io.open_csv_lazy vaex.io.open_csv_memory (better be explicit?).
- How to we expose the pandas route?
I want to move some input/output function from __init__.py
into io.py
, let me know if you like it.
Stats on a 70GB CSV file (on nyx, 64 cores AMD ryzen):
- Openining: 4-6 second (fast row count estimate)
$ time py.test tests/csv_test.py -v -k test_large_csv_count_array_lengths
- Reading a single column 9-10 seconds
$ time py.test tests/csv_test.py -v -k test_large_csv_count_array_lengths
TODO
- [x] Expose all Arrow options https://arrow.apache.org/docs/python/generated/pyarrow.csv.open_csv.html#pyarrow.csv.open_csv
- [x] clean up tests
Hi @maartenbreddels Sorry, I am seeing this PR. A question out of curiosity: does the Apache Arrow Out-of core CSV reader is able to work with zipped csv? Having the csv files zipped is something common (at least, pandas read them, transparently unzipping them I guess). Does Apache Arrow do the same? Clearly not understanding in depth memory mapping, I could hint that this zipping makes things more complex, does it?
In this PR we will also try to support reading of gziped CSV. Here are some relevant threads or comments:
- https://github.com/vaexio/vaex/issues/2070
- https://github.com/vaexio/vaex/issues/1879 (relevant discussion)
- Python package / main (macOS-latest, 3.6) (pull_request)
This one hangs quite regularly