gpq
gpq copied to clipboard
About compression: is it normal for it to be so low?
Hi, I'm testing gpq on the official administrative boundaries of Italy. The source file is this zip file: https://www.istat.it/storage/cartografia/confini_amministrativi/non_generalizzati/2023/Limiti01012023.zip
It has a folder structure, with shapefiles in it. I am doing the tests on the Limiti01012023/Com01012023/Com01012023_WGS84.shp
file:
- I convert it to geojson using ogr2ogr;
- using this geojson I create a gzip compressed geoparquet file, it has the size of 70 MB
- using the same geojson I create an uncompressed geoparquet file, it has the size of 76 MB
They are almost equal in size. Some notes:
- if I gzip the uncompressed parquet file I get a 57 MB file
- if I create a sozip shp version of the source file, I get a 59 MB file
I know, I can't compare these outputs, however, it seems to me very limited compression in gpq output. Is it normal? Am I doing something wrong?
Below the way I have tested all.
Thank you
wget -O file.zip "https://www.istat.it/storage/cartografia/confini_amministrativi/non_generalizzati/2023/Limiti01012023.zip"
unzip -o file.zip -d .
ogr2ogr -f GeoJSON -t_srs EPSG:4326 comuni.geojson Limiti01012023/Com01012023/Com01012023_WGS84.shp -lco "RFC7946=YES"
gpq convert --compression="gzip" --max 1000 --from="geojson" comuni.geojson comuni_compressed.parquet
gpq convert --compression="uncompressed" --max 1000 --from="geojson" comuni.geojson comuni_uncompressed.parquet
ogr2ogr -t_srs EPSG:4326 Com01012023_WGS84.shp.zip Limiti01012023/Com01012023/Com01012023_WGS84.shp