Inconsistent row count between North America extract and planet file using `ST_ReadOsm()`
Hello!
I was playing around with ST_ReadOsm() and I noticed inconsistent results for the same query when run on the North America extract compared to the Planet file. I'm wondering if this might be a bug because I'm not sure how to explain the behavior.
The north america extract returns 245,000 rows but the planet only returns 90,000. I'd expect the planet to return far more than North America. Both files were downloaded today.
D SELECT COUNT(*) FROM st_readOSM('/home/wcedmisten/Downloads/planet-240610.osm.pbf') WHERE tags['amenity'] = ['restaurant'];
100% ▕████████████████████████████████████████████████████████████▏
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 89751 │
└──────────────┘
D SELECT COUNT(*) FROM st_readOSM('/home/wcedmisten/Downloads/north-america-latest.osm.pbf') WHERE tags['amenity'] = ['restaurant'];
100% ▕████████████████████████████████████████████████████████████▏
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 244678 │
└──────────────┘
DuckDB version:
~/Downloads/duckdb --version
v1.0.0 1f98600c2c
OS version:
lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04
Codename: jammy
Here's a specific way that is missing from the DuckDB query:
D SELECT kind, id FROM st_readOSM('/home/wcedmisten/Downloads/north-america-latest.osm.pbf') WHERE id = 18991954;
100% ▕████████████████████████████████████████████████████████████▏
┌──────────────────────────────────────────────┬──────────┐
│ kind │ id │
│ enum('node', 'way', 'relation', 'changeset') │ int64 │
├──────────────────────────────────────────────┼──────────┤
│ way │ 18991954 │
└──────────────────────────────────────────────┴──────────┘
D SELECT kind, id FROM st_readOSM('/home/wcedmisten/Downloads/planet-240610.osm.pbf') WHERE id = 18991954;
100% ▕████████████████████████████████████████████████████████████▏
┌──────────────────────────────────────────────┬───────┐
│ kind │ id │
│ enum('node', 'way', 'relation', 'changeset') │ int64 │
├──────────────────────────────────────────────┴───────┤
│ 0 rows │
└──────────────────────────────────────────────────────┘
I verified that this way does exist via osmium, so I think this points to being a duckDB bug.
osmium getid --verbose-ids --overwrite --default-type=way ~/Downloads/planet-240610.osm.pbf 18991954 -o test.osm.bz2
[ 0:00] Started osmium getid
[ 0:00] osmium version 1.14.0
[ 0:00] libosmium version 2.18.0
[ 0:00] Command line options and default settings:
[ 0:00] input options:
[ 0:00] file name: /home/wcedmisten/Downloads/planet-240610.osm.pbf
[ 0:00] file format:
[ 0:00] output options:
[ 0:00] file name: test.osm.bz2
[ 0:00] file format:
[ 0:00] generator: osmium/1.14.0
[ 0:00] overwrite: yes
[ 0:00] fsync: no
[ 0:00] other options:
[ 0:00] add referenced objects: no
[ 0:00] work with history files: no
[ 0:00] default object type: way
[ 0:00] looking for these ids:
[ 0:00] nodes:
[ 0:00] ways: 18991954
[ 0:00] relations:
[ 0:00] Opening input file...
[ 0:00] Opening output file...
[ 0:00] Copying matching objects to output file...
[======================================================================] 100%
[ 2:32] Closing output file...
[ 2:32] Closing input file...
[ 2:32] Found all objects.
[ 2:32] Peak memory used: 1185 MBytes
[ 2:32] Done.
Hello! Thanks for raising this issue! This might very well be a bug... Where did you get the planet/NA extract from? GeoFabrik?
Yeah the extract was through Geofabrik and the planet file was through the official torrent (from this page)
After noticing that planet queries were missing ~95% of the nodes, I found this issue, and decided to dig in a bit.
I believe what's happening is that only the first 'group' is being read per 'block'.
As soon as the granularity and offsets are read, the block_reader will advance beyond any remaining 'group' (2) nodes.
I was able to confirm that by removing the 'block_reader.next' calls in the ParseState::Block case, DuckDB returns the correct count of nodes/ways/relations:
Before:
│ node │ 470272000 │
│ way │ 124680000 │
│ relation │ 78600 │
After:
│ node │ 9325981030 │
│ way │ 1042862530 │
│ relation │ 12420430 │
This seems a little messy to fix, as the granularity/offsets are used during handling of 'group' messages.
Tested with the latest planet and the nightly DuckDB build. Seems to be fixed, thanks for the quick turnaround!!! This project is awesome :tada:
D SELECT COUNT(*) FROM st_readOSM('/home/wcedmisten/Downloads/planet-240826.osm.pbf') WHERE tags['amenity'] = ['restaurant'];
100% ▕████████████████████████████████████████████████████████████▏
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 1449974 │
└──────────────┘
D SELECT kind, id FROM st_readOSM('/home/wcedmisten/Downloads/planet-240826.osm.pbf') WHERE id = 18991954;
100% ▕████████████████████████████████████████████████████████████▏
┌──────────────────────────────────────────────┬──────────┐
│ kind │ id │
│ enum('node', 'way', 'relation', 'changeset') │ int64 │
├──────────────────────────────────────────────┼──────────┤
│ way │ 18991954 │
└──────────────────────────────────────────────┴──────────┘