pgeocode Faster dataset loading

Faster dataset loading

Open rth opened this issue 5 years ago • 1 comments

Currently we load datasets with pd.read_csv from gzipped CSV format. Loading should be much improved by converting the data to parquet format and using pd.read_parquet (this might also reduce the size of downloads when using e.g. snappy compression).

Though the limitations of this approach is that datasets would need to be hosted somewhere and a new dependency (pyarrow) would need to be added. I'm not sure that it would be worth it.

Apr 07 '19 08:04 rth

Also for caching, it might make sense to use pickle instead of csv. Though then it's a less portable across python versions.

Mar 20 '20 10:03 rth

Closing as not critical. Unless someone feels it's too slow currently.

Dec 13 '22 22:12 rth

pgeocode pgeocode copied to clipboard

Faster dataset loading

pgeocode
pgeocode copied to clipboard