csvs-to-sqlite icon indicating copy to clipboard operation
csvs-to-sqlite copied to clipboard

Handle giant CSV files

Open simonw opened this issue 8 years ago • 4 comments

It would be great if csvs-to-sqlite could quite happily churn through a multiple GB CSV file without using up all available RAM on a machine.

https://plot.ly/python/big-data-analytics-with-pandas-and-sqlite/ has some good patterns for dealing with this sort of thing, and links to a good 3.9GB CSV file for trying this out.

simonw avatar Nov 28 '17 23:11 simonw

Bonus: would be great if this could read from compressed CSVs. I have a 1.6GB 311_Service_Requests_from_2010_to_Present.csv.gz file in my Downloads directory right now which is 10GB uncompressed.

Maybe the neatest solution for this is to pipe through gunzip directly into csvs-to-sqlite stdin.

simonw avatar Nov 29 '17 19:11 simonw

pd.read_csv has a compression keyword argument, but by default it will try to infer from popular extensions (.gz included). So it looks like reading from .csv.gz already works thanks to the pandas magic.

stefanw avatar Jan 16 '18 13:01 stefanw

Is there any workaround, e.g. split .csv files file into small ones, but still get one sqlite db somehow?

madnight avatar Aug 01 '18 00:08 madnight

I've written a script to use myself that doesn't use pandas. It loads into sqlite row by row instead.

It is specific for my use case but could be made generic pretty easily. https://github.com/tannewt/campaign-funds.org/blob/main/update.py

One of the biggest speedups was moving from dateparser to dateutil.

I also recommend using btrfs if you deal with many large CSV and JSON files because btrfs will transparently compress and decompress the files.

tannewt avatar Aug 18 '21 06:08 tannewt