duckdb_spatial icon indicating copy to clipboard operation
duckdb_spatial copied to clipboard

OOM issue importing an 84 GB OSM file into DuckDB

Open marklit opened this issue 1 year ago • 12 comments

4 of the 5 layers will import without issue and they produced ~84 GB of DuckDB-formatted data. This 5th layer for lines causes an OOM. This was run on a machine with 64 GB of RAM.

The Overture dataset is about 1.5x bigger than the OSM extracts Geofabrik produce so I'm not sure if GDAL or anything else that reads OSM files has ever been tested thoroughly on a dataset of this size.

$ wget -c https://overturemaps-us-west-2.s3.amazonaws.com/release/2023-04-02-alpha/planet-2023-04-02-alpha.osm.pbf

$ for LAYER in lines; do
    echo "CREATE OR REPLACE TABLE osm_$LAYER AS
              SELECT *
              FROM st_read('planet-2023-04-02-alpha.osm.pbf',
                           open_options=['INTERLEAVED_READING=YES'],
                           layer='$LAYER',
                           sequential_layer_scan=true);" \
        | ~/duckdb_spatial/build/release/duckdb \
                -unsigned \
                overture.duckdb
done
Killed

These are the 4 layers that imported without issue for reference.

$ ogrinfo -ro planet-2023-04-02-alpha.osm.pbf 
INFO: Open of `planet-2023-04-02-alpha.osm.pbf'
      using driver `OSM' successful.
1: points (Point)
2: lines (Line String)
3: multilinestrings (Multi Line String)
4: multipolygons (Multi Polygon)
5: other_relations (Geometry Collection)
┌────────────┐
│ num_points │
│   int64    │
├────────────┤
│  181527241 │
└────────────┘
    osm_id = 1
      name = Monte Piselli - San Giacomo
   barrier = 
   highway = 
       ref = 
   address = 
     is_in = 
     place = 
  man_made = mast
other_tags = "tower:type"=>"communication","tower:construction"=>"lattice","frequency"=>"105.5 MHz","description"=>"Radio Subasio","communication:radio"=>"fm","communication:microwave"=>"yes"
┌──────────────────────┐
│ num_multilinestrings │
│        int64         │
├──────────────────────┤
│                14594 │
└──────────────────────┘
    osm_id = 18
      name = Section Catalonia GNR02
      type = route
other_tags = "ref"=>"TET:EU:ES:GNR:02:Catalonia","operator"=>"www.transeurotrail.org","name:ca"=>"Secció Catalunya GNR02"
┌───────────────────┐
│ num_multipolygons │
│       int64       │
├───────────────────┤
│             26159 │
└───────────────────┘
     osm_id = 11
 osm_way_id = 
       name = Tween Pond
       type = multipolygon
    aeroway = 
    amenity = 
admin_level = 
    barrier = 
   boundary = 
   building = 
      craft = 
 geological = 
   historic = 
  land_area = 
    landuse = 
    leisure = 
   man_made = 
   military = 
    natural = water
     office = 
      place = 
       shop = 
      sport = 
    tourism = 
 other_tags = 
┌─────────────────────┐
│ num_other_relations │
│        int64        │
├─────────────────────┤
│                9847 │
└─────────────────────┘
    osm_id = 2202
      name = Osterbek
      type = waterway
other_tags = "wikipedia"=>"de:Osterbek","wikidata"=>"Q321993","waterway"=>"canal","ref:fgkz"=>"595674","destination"=>"Alster"

marklit avatar Apr 18 '23 05:04 marklit