vaex icon indicating copy to clipboard operation
vaex copied to clipboard

feat: Out of core CSV support using Apache Arrow CSV reader (fast 🔥!)

Open maartenbreddels opened this issue 4 years ago • 2 comments

@JovanVeljanoski we need to discuss this, how we expose this. Questions

  • Do we always want to have lazy csv reading? Or if below say 20% of available RAM, load into memory directly? Or special methods? vaex.io.open_csv_lazy vaex.io.open_csv_memory (better be explicit?).
  • How to we expose the pandas route?

I want to move some input/output function from __init__.py into io.py, let me know if you like it.

Stats on a 70GB CSV file (on nyx, 64 cores AMD ryzen):

  • Openining: 4-6 second (fast row count estimate) $ time py.test tests/csv_test.py -v -k test_large_csv_count_array_lengths
  • Reading a single column 9-10 seconds $ time py.test tests/csv_test.py -v -k test_large_csv_count_array_lengths

TODO

  • [x] Expose all Arrow options https://arrow.apache.org/docs/python/generated/pyarrow.csv.open_csv.html#pyarrow.csv.open_csv
  • [x] clean up tests

maartenbreddels avatar Oct 28 '20 20:10 maartenbreddels

Hi @maartenbreddels Sorry, I am seeing this PR. A question out of curiosity: does the Apache Arrow Out-of core CSV reader is able to work with zipped csv? Having the csv files zipped is something common (at least, pandas read them, transparently unzipping them I guess). Does Apache Arrow do the same? Clearly not understanding in depth memory mapping, I could hint that this zipping makes things more complex, does it?

yohplala avatar Sep 16 '22 11:09 yohplala

In this PR we will also try to support reading of gziped CSV. Here are some relevant threads or comments:

  • https://github.com/vaexio/vaex/issues/2070
  • https://github.com/vaexio/vaex/issues/1879 (relevant discussion)

JovanVeljanoski avatar Sep 20 '22 08:09 JovanVeljanoski

  • Python package / main (macOS-latest, 3.6) (pull_request)

This one hangs quite regularly

maartenbreddels avatar Sep 23 '22 12:09 maartenbreddels