duckdb_spatial icon indicating copy to clipboard operation
duckdb_spatial copied to clipboard

OOM issue importing an 84 GB OSM file into DuckDB

Open marklit opened this issue 2 years ago • 12 comments

4 of the 5 layers will import without issue and they produced ~84 GB of DuckDB-formatted data. This 5th layer for lines causes an OOM. This was run on a machine with 64 GB of RAM.

The Overture dataset is about 1.5x bigger than the OSM extracts Geofabrik produce so I'm not sure if GDAL or anything else that reads OSM files has ever been tested thoroughly on a dataset of this size.

$ wget -c https://overturemaps-us-west-2.s3.amazonaws.com/release/2023-04-02-alpha/planet-2023-04-02-alpha.osm.pbf

$ for LAYER in lines; do
    echo "CREATE OR REPLACE TABLE osm_$LAYER AS
              SELECT *
              FROM st_read('planet-2023-04-02-alpha.osm.pbf',
                           open_options=['INTERLEAVED_READING=YES'],
                           layer='$LAYER',
                           sequential_layer_scan=true);" \
        | ~/duckdb_spatial/build/release/duckdb \
                -unsigned \
                overture.duckdb
done
Killed

These are the 4 layers that imported without issue for reference.

$ ogrinfo -ro planet-2023-04-02-alpha.osm.pbf 
INFO: Open of `planet-2023-04-02-alpha.osm.pbf'
      using driver `OSM' successful.
1: points (Point)
2: lines (Line String)
3: multilinestrings (Multi Line String)
4: multipolygons (Multi Polygon)
5: other_relations (Geometry Collection)
┌────────────┐
│ num_points │
│   int64    │
├────────────┤
│  181527241 │
└────────────┘
    osm_id = 1
      name = Monte Piselli - San Giacomo
   barrier = 
   highway = 
       ref = 
   address = 
     is_in = 
     place = 
  man_made = mast
other_tags = "tower:type"=>"communication","tower:construction"=>"lattice","frequency"=>"105.5 MHz","description"=>"Radio Subasio","communication:radio"=>"fm","communication:microwave"=>"yes"
┌──────────────────────┐
│ num_multilinestrings │
│        int64         │
├──────────────────────┤
│                14594 │
└──────────────────────┘
    osm_id = 18
      name = Section Catalonia GNR02
      type = route
other_tags = "ref"=>"TET:EU:ES:GNR:02:Catalonia","operator"=>"www.transeurotrail.org","name:ca"=>"Secció Catalunya GNR02"
┌───────────────────┐
│ num_multipolygons │
│       int64       │
├───────────────────┤
│             26159 │
└───────────────────┘
     osm_id = 11
 osm_way_id = 
       name = Tween Pond
       type = multipolygon
    aeroway = 
    amenity = 
admin_level = 
    barrier = 
   boundary = 
   building = 
      craft = 
 geological = 
   historic = 
  land_area = 
    landuse = 
    leisure = 
   man_made = 
   military = 
    natural = water
     office = 
      place = 
       shop = 
      sport = 
    tourism = 
 other_tags = 
┌─────────────────────┐
│ num_other_relations │
│        int64        │
├─────────────────────┤
│                9847 │
└─────────────────────┘
    osm_id = 2202
      name = Osterbek
      type = waterway
other_tags = "wikipedia"=>"de:Osterbek","wikidata"=>"Q321993","waterway"=>"canal","ref:fgkz"=>"595674","destination"=>"Alster"

marklit avatar Apr 18 '23 05:04 marklit

If it's worth anything, the OSM project for importing PBFs into PG will run out of RAM on a 64 GB system with default settings https://github.com/openstreetmap/osm2pgsql/issues/1954 I'm trying their --slim and --drop flags to see if they help.

marklit avatar Apr 19 '23 05:04 marklit

@marklit Not sure if you're interested - but as part of the Daylight Map project (daylightmap.org) we publish parquet files including geometries of the underlying OSM data. See https://daylightmap.org/daylight-osm-distribution-parquet.html for more. See also https://aws.amazon.com/blogs/publicsector/querying-daylight-openstreetmap-distribution-amazon-athena/

jwass avatar May 01 '23 15:05 jwass

Thanks @jwass funny enough, I came across that dataset a few weeks ago when I was looking at the canopy one Meta published for California.

marklit avatar May 04 '23 09:05 marklit

The following worked but took 5.5 days to complete.

$ osm2pgsql -d overture --slim --drop planet-2023-04-02-alpha.osm.pbf

marklit avatar May 04 '23 14:05 marklit

@marklit when working in a low memory environment, I've started using open_option USE_CUSTOM_INDEXING=NO. It might not be as fast but got the job done without crashing the database.

RaczeQ avatar May 31 '23 19:05 RaczeQ

FWIW I'm going to look at better supporting this use case tomorrow.

Maxxen avatar May 31 '23 20:05 Maxxen

Depending on how you plan to use OSM data you may or may not find the progress in duckdb/duckdb_spatial#90 useful. It's super fast, but it doesn't get you "all the way" in the sense that it doesn't produce complete GEOMETRY objects. We're thinking of how to actually perform the geometry reconstruction in an efficient and easy to use way but its most likely going to requires significantly more work.

Maxxen avatar Jun 06 '23 17:06 Maxxen

The Overture canopy dataset has been taking weeks to process with PG. I'm going to try and find a DuckDB-only or DuckDB + Shapely way of getting this data into BQ. I'll report back with any findings I come across.

marklit avatar Jun 07 '23 10:06 marklit

I tried loading a remote PBF file, but failed for the following query:

CREATE OR REPLACE TABLE osm_points_monaco AS SELECT * FROM st_read('/vsicurl/https://download.geofabrik.de/europe/monaco-latest.osm.pbf', open_options=['INTERLEAVED_READING=YES'], layer='points', sequential_layer_scan=true);

with

IO Error: Could not open file: /vsicurl/https://download.geofabrik.de/europe/monaco-latest.osm.pbf (Could not parse configuration file for OSM import)

Do you have an idea on how to solve this @marklit / @Maxxen?

tobilg avatar Aug 31 '23 10:08 tobilg

@tobilg check out osmconf.ini described here https://gdal.org/drivers/vector/osm.html also there is native osm pbf reader if you can live with raw objects (have to build geometries yourself)

ttomasz avatar Aug 31 '23 10:08 ttomasz

@ttomasz thanks for the fast reply! I was assuming that it's already included (somehow) in the build, as the above example didn't use it. Look like I'd have to ship this file with my custom spatial extension build for AWS Lambda then :-/

tobilg avatar Aug 31 '23 10:08 tobilg

@tobilg I'm not sure how you'd get that call working with viscurl. I've always downloaded the OSM data as its own step first and then called DuckDB after that. duckdb/duckdb#10609 discusses how some reading functionality is handled by other parts of DuckDB, I'm not sure if calling viscurl would trigger code paths outside of the Spatial Extension.

marklit avatar Feb 10 '24 08:02 marklit