csvs-to-sqlite icon indicating copy to clipboard operation
csvs-to-sqlite copied to clipboard

Optionally maintain checksums of CSV files for faster updates

Open dkaoster opened this issue 2 years ago • 0 comments

Wanted to see if there is interest in a patch that helps speed up our workflows significantly, or if there are any further ideas for improving on such a feature. If this is out of scope for this project, I'm happy to continue maintaining my fork of this project.

Use Case

We currently maintain a folder of >200 CSV files with a total of a few hundred megabytes, and have a CI step that builds these CSVs into a sqlite database. These CSV files get updated 2-3 times a day, but only small changes are made to them. Currently, running csvs-to-sqlite with the --replace-tables flag takes roughly 6-7 minutes, which is too long for our use case.

Solution

Add a --update-tables flag that maintains a checksum hash of each CSV file in a table called .csvs-meta (happy to change this or make it configurable), and only reads the csv and loads the dataframe if the checksum has changed.

Forked Version Here

dkaoster avatar Jan 21 '22 08:01 dkaoster