duckdb_spatial
duckdb_spatial copied to clipboard
OOM issue importing an 84 GB OSM file into DuckDB
4 of the 5 layers will import without issue and they produced ~84 GB of DuckDB-formatted data. This 5th layer for lines
causes an OOM. This was run on a machine with 64 GB of RAM.
The Overture dataset is about 1.5x bigger than the OSM extracts Geofabrik produce so I'm not sure if GDAL or anything else that reads OSM files has ever been tested thoroughly on a dataset of this size.
$ wget -c https://overturemaps-us-west-2.s3.amazonaws.com/release/2023-04-02-alpha/planet-2023-04-02-alpha.osm.pbf
$ for LAYER in lines; do
echo "CREATE OR REPLACE TABLE osm_$LAYER AS
SELECT *
FROM st_read('planet-2023-04-02-alpha.osm.pbf',
open_options=['INTERLEAVED_READING=YES'],
layer='$LAYER',
sequential_layer_scan=true);" \
| ~/duckdb_spatial/build/release/duckdb \
-unsigned \
overture.duckdb
done
Killed
These are the 4 layers that imported without issue for reference.
$ ogrinfo -ro planet-2023-04-02-alpha.osm.pbf
INFO: Open of `planet-2023-04-02-alpha.osm.pbf'
using driver `OSM' successful.
1: points (Point)
2: lines (Line String)
3: multilinestrings (Multi Line String)
4: multipolygons (Multi Polygon)
5: other_relations (Geometry Collection)
┌────────────┐
│ num_points │
│ int64 │
├────────────┤
│ 181527241 │
└────────────┘
osm_id = 1
name = Monte Piselli - San Giacomo
barrier =
highway =
ref =
address =
is_in =
place =
man_made = mast
other_tags = "tower:type"=>"communication","tower:construction"=>"lattice","frequency"=>"105.5 MHz","description"=>"Radio Subasio","communication:radio"=>"fm","communication:microwave"=>"yes"
┌──────────────────────┐
│ num_multilinestrings │
│ int64 │
├──────────────────────┤
│ 14594 │
└──────────────────────┘
osm_id = 18
name = Section Catalonia GNR02
type = route
other_tags = "ref"=>"TET:EU:ES:GNR:02:Catalonia","operator"=>"www.transeurotrail.org","name:ca"=>"Secció Catalunya GNR02"
┌───────────────────┐
│ num_multipolygons │
│ int64 │
├───────────────────┤
│ 26159 │
└───────────────────┘
osm_id = 11
osm_way_id =
name = Tween Pond
type = multipolygon
aeroway =
amenity =
admin_level =
barrier =
boundary =
building =
craft =
geological =
historic =
land_area =
landuse =
leisure =
man_made =
military =
natural = water
office =
place =
shop =
sport =
tourism =
other_tags =
┌─────────────────────┐
│ num_other_relations │
│ int64 │
├─────────────────────┤
│ 9847 │
└─────────────────────┘
osm_id = 2202
name = Osterbek
type = waterway
other_tags = "wikipedia"=>"de:Osterbek","wikidata"=>"Q321993","waterway"=>"canal","ref:fgkz"=>"595674","destination"=>"Alster"