red-datasets icon indicating copy to clipboard operation
red-datasets copied to clipboard

Improve loading/parsing speed in 'arrowable' environment

Open heronshoes opened this issue 1 year ago • 2 comments

It takes a long time to read a large dataset from a source for the first time.

I created a fresh Docker environment for my dataframe example and found it very time consuming to pull a large dataset of nycflights13.

If you use red-dataset-arrow, the cache is stored in the arrow file, but the first time you load it, it takes a long time to load and parse because it uses Ruby's CSV.

Is it possible to make the environment extended with red-dataset-arrow use arrow to load and parse?

heronshoes avatar Mar 31 '23 22:03 heronshoes

How about adding Datasets::CSVParser like Datasets::ZipExtractor and extending Datasets::CSVParser in red-datasets-arrow?

kou avatar Apr 01 '23 13:04 kou

Thanks @kou .

I will make a try to add Datasets::CSVParser first!

heronshoes avatar Apr 02 '23 01:04 heronshoes