csvs-to-sqlite
csvs-to-sqlite copied to clipboard
Handle giant CSV files
It would be great if csvs-to-sqlite could quite happily churn through a multiple GB CSV file without using up all available RAM on a machine.
https://plot.ly/python/big-data-analytics-with-pandas-and-sqlite/ has some good patterns for dealing with this sort of thing, and links to a good 3.9GB CSV file for trying this out.
Bonus: would be great if this could read from compressed CSVs. I have a 1.6GB 311_Service_Requests_from_2010_to_Present.csv.gz file in my Downloads directory right now which is 10GB uncompressed.
Maybe the neatest solution for this is to pipe through gunzip directly into csvs-to-sqlite stdin.
pd.read_csv has a compression keyword argument, but by default it will try to infer from popular extensions (.gz included). So it looks like reading from .csv.gz already works thanks to the pandas magic.
Is there any workaround, e.g. split .csv files file into small ones, but still get one sqlite db somehow?
I've written a script to use myself that doesn't use pandas. It loads into sqlite row by row instead.
It is specific for my use case but could be made generic pretty easily. https://github.com/tannewt/campaign-funds.org/blob/main/update.py
One of the biggest speedups was moving from dateparser to dateutil.
I also recommend using btrfs if you deal with many large CSV and JSON files because btrfs will transparently compress and decompress the files.