duckdb_spatial
duckdb_spatial copied to clipboard
OOM issue importing an 84 GB OSM file into DuckDB
4 of the 5 layers will import without issue and they produced ~84 GB of DuckDB-formatted data. This 5th layer for lines causes an OOM. This was run on a machine with 64 GB of RAM.
The Overture dataset is about 1.5x bigger than the OSM extracts Geofabrik produce so I'm not sure if GDAL or anything else that reads OSM files has ever been tested thoroughly on a dataset of this size.
$ wget -c https://overturemaps-us-west-2.s3.amazonaws.com/release/2023-04-02-alpha/planet-2023-04-02-alpha.osm.pbf
$ for LAYER in lines; do
echo "CREATE OR REPLACE TABLE osm_$LAYER AS
SELECT *
FROM st_read('planet-2023-04-02-alpha.osm.pbf',
open_options=['INTERLEAVED_READING=YES'],
layer='$LAYER',
sequential_layer_scan=true);" \
| ~/duckdb_spatial/build/release/duckdb \
-unsigned \
overture.duckdb
done
Killed
These are the 4 layers that imported without issue for reference.
$ ogrinfo -ro planet-2023-04-02-alpha.osm.pbf
INFO: Open of `planet-2023-04-02-alpha.osm.pbf'
using driver `OSM' successful.
1: points (Point)
2: lines (Line String)
3: multilinestrings (Multi Line String)
4: multipolygons (Multi Polygon)
5: other_relations (Geometry Collection)
┌────────────┐
│ num_points │
│ int64 │
├────────────┤
│ 181527241 │
└────────────┘
osm_id = 1
name = Monte Piselli - San Giacomo
barrier =
highway =
ref =
address =
is_in =
place =
man_made = mast
other_tags = "tower:type"=>"communication","tower:construction"=>"lattice","frequency"=>"105.5 MHz","description"=>"Radio Subasio","communication:radio"=>"fm","communication:microwave"=>"yes"
┌──────────────────────┐
│ num_multilinestrings │
│ int64 │
├──────────────────────┤
│ 14594 │
└──────────────────────┘
osm_id = 18
name = Section Catalonia GNR02
type = route
other_tags = "ref"=>"TET:EU:ES:GNR:02:Catalonia","operator"=>"www.transeurotrail.org","name:ca"=>"Secció Catalunya GNR02"
┌───────────────────┐
│ num_multipolygons │
│ int64 │
├───────────────────┤
│ 26159 │
└───────────────────┘
osm_id = 11
osm_way_id =
name = Tween Pond
type = multipolygon
aeroway =
amenity =
admin_level =
barrier =
boundary =
building =
craft =
geological =
historic =
land_area =
landuse =
leisure =
man_made =
military =
natural = water
office =
place =
shop =
sport =
tourism =
other_tags =
┌─────────────────────┐
│ num_other_relations │
│ int64 │
├─────────────────────┤
│ 9847 │
└─────────────────────┘
osm_id = 2202
name = Osterbek
type = waterway
other_tags = "wikipedia"=>"de:Osterbek","wikidata"=>"Q321993","waterway"=>"canal","ref:fgkz"=>"595674","destination"=>"Alster"
If it's worth anything, the OSM project for importing PBFs into PG will run out of RAM on a 64 GB system with default settings https://github.com/openstreetmap/osm2pgsql/issues/1954 I'm trying their --slim and --drop flags to see if they help.
@marklit Not sure if you're interested - but as part of the Daylight Map project (daylightmap.org) we publish parquet files including geometries of the underlying OSM data. See https://daylightmap.org/daylight-osm-distribution-parquet.html for more. See also https://aws.amazon.com/blogs/publicsector/querying-daylight-openstreetmap-distribution-amazon-athena/
Thanks @jwass funny enough, I came across that dataset a few weeks ago when I was looking at the canopy one Meta published for California.
The following worked but took 5.5 days to complete.
$ osm2pgsql -d overture --slim --drop planet-2023-04-02-alpha.osm.pbf
@marklit when working in a low memory environment, I've started using open_option USE_CUSTOM_INDEXING=NO. It might not be as fast but got the job done without crashing the database.
FWIW I'm going to look at better supporting this use case tomorrow.
Depending on how you plan to use OSM data you may or may not find the progress in duckdb/duckdb_spatial#90 useful. It's super fast, but it doesn't get you "all the way" in the sense that it doesn't produce complete GEOMETRY objects. We're thinking of how to actually perform the geometry reconstruction in an efficient and easy to use way but its most likely going to requires significantly more work.
The Overture canopy dataset has been taking weeks to process with PG. I'm going to try and find a DuckDB-only or DuckDB + Shapely way of getting this data into BQ. I'll report back with any findings I come across.
I tried loading a remote PBF file, but failed for the following query:
CREATE OR REPLACE TABLE osm_points_monaco AS SELECT * FROM st_read('/vsicurl/https://download.geofabrik.de/europe/monaco-latest.osm.pbf', open_options=['INTERLEAVED_READING=YES'], layer='points', sequential_layer_scan=true);
with
IO Error: Could not open file: /vsicurl/https://download.geofabrik.de/europe/monaco-latest.osm.pbf (Could not parse configuration file for OSM import)
Do you have an idea on how to solve this @marklit / @Maxxen?
@tobilg check out osmconf.ini described here https://gdal.org/drivers/vector/osm.html also there is native osm pbf reader if you can live with raw objects (have to build geometries yourself)
@ttomasz thanks for the fast reply! I was assuming that it's already included (somehow) in the build, as the above example didn't use it. Look like I'd have to ship this file with my custom spatial extension build for AWS Lambda then :-/
@tobilg I'm not sure how you'd get that call working with viscurl. I've always downloaded the OSM data as its own step first and then called DuckDB after that. duckdb/duckdb#10609 discusses how some reading functionality is handled by other parts of DuckDB, I'm not sure if calling viscurl would trigger code paths outside of the Spatial Extension.