pgeocode
pgeocode copied to clipboard
Faster dataset loading
Currently we load datasets with pd.read_csv
from gzipped CSV format. Loading should be much improved by converting the data to parquet format and using pd.read_parquet
(this might also reduce the size of downloads when using e.g. snappy compression).
Though the limitations of this approach is that datasets would need to be hosted somewhere and a new dependency (pyarrow) would need to be added. I'm not sure that it would be worth it.
Also for caching, it might make sense to use pickle instead of csv. Though then it's a less portable across python versions.
Closing as not critical. Unless someone feels it's too slow currently.