RoadDetections icon indicating copy to clipboard operation
RoadDetections copied to clipboard

Consider partitioning countries at the file level rather than marking countries in a TSV row

Open marklit opened this issue 2 years ago • 0 comments

Oceania-Full.zip is 282 MB at the moment. If its GeoJSON file was partitioned by country and sorted the ZIP file would be 244 MB instead. This would allow people to download the ZIP file faster. They would also use less space picking out the countries they're interested in. The GeoJSON would open right away in QGIS and other GIS software without first needing to ETL the TSV.

$ vi a.sh
sort AUS.geojson > AUS.sorted.geojson
sort NZL.geojson > NZL.sorted.geojson
sort PNG.geojson > PNG.sorted.geojson
sort VUT.geojson > VUT.sorted.geojson
sort FJI.geojson > FJI.sorted.geojson
sort SLB.geojson > SLB.sorted.geojson
sort TON.geojson > TON.sorted.geojson
sort WSM.geojson > WSM.sorted.geojson
sort FSM.geojson > FSM.sorted.geojson
sort KIR.geojson > KIR.sorted.geojson
sort PLW.geojson > PLW.sorted.geojson
sort MHL.geojson > MHL.sorted.geojson
sort TUV.geojson > TUV.sorted.geojson
sort NRU.geojson > NRU.sorted.geojson
$ cat a.sh | xargs -n1 -P4 -I% bash -xc '%'
$ zip -9 Oceania.sorted.zip \
    AUS.sorted.geojson \
    NZL.sorted.geojson \
    PNG.sorted.geojson \
    VUT.sorted.geojson \
    FJI.sorted.geojson \
    SLB.sorted.geojson \
    TON.sorted.geojson \
    WSM.sorted.geojson \
    FSM.sorted.geojson \
    KIR.sorted.geojson \
    PLW.sorted.geojson \
    MHL.sorted.geojson \
    TUV.sorted.geojson \
    NRU.sorted.geojson

$ unzip -l Oceania.sorted.zip
Archive:  Oceania.sorted.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
1071521607  2023-04-10 18:58   AUS.sorted.geojson
185466598  2023-04-10 18:57   NZL.sorted.geojson
 28007237  2023-04-10 18:57   PNG.sorted.geojson
  6470562  2023-04-10 18:57   VUT.sorted.geojson
  5832797  2023-04-10 18:57   FJI.sorted.geojson
  4423195  2023-04-10 18:57   SLB.sorted.geojson
  1047604  2023-04-10 18:57   TON.sorted.geojson
  1066450  2023-04-10 18:57   WSM.sorted.geojson
   307308  2023-04-10 18:57   FSM.sorted.geojson
   190892  2023-04-10 18:57   KIR.sorted.geojson
   242639  2023-04-10 18:57   PLW.sorted.geojson
   119872  2023-04-10 18:57   MHL.sorted.geojson
    44300  2023-04-10 18:57   TUV.sorted.geojson
    38006  2023-04-10 18:57   NRU.sorted.geojson
---------                     -------
1304779067                     14 files
$ unzip Oceania.sorted.zip NZL.sorted.geojson

For some of the largest datasets, like Canada and Japan, the 3-letter country identifier is redundant since every record in those ZIPs are for their respective countries.

marklit avatar Apr 10 '23 16:04 marklit