quackosm
quackosm copied to clipboard
cli: Add option to export to DuckDB
I love your project enabling next-gen OSM analyses with the power of DuckDB. Wouldn't it be cool, to enable specifying a DuckDB file (and optionally table) to store the data in? I know DuckDB has a native OSM reader. But you abstracted to the convenience of Points, Lines and Polygons.
What I think would be a cool possible outcome is, e.g.
quackosm monaco-latest.osm.pbf --duckdb monaco_osm.duckdb 'osm'
instead of what's currently possible:
quackosm monaco-latest.osm.pbf
⚫◗ create table osm as
select feature_id, tags, st_geomfromwkb(geometry) as geometry
from read_parquet('files/monaco-latest_nofilter_noclip_compact.geoparquet');
If you think this is an interesting idea, I'd love to contribute. Maybe we could discuss possible entry points for where to start.
Hello @mwip 👋🏻
I think that this might be an interesting idea. As you've mentioned, the current and probably the easiest option would be to just run a CREATE TABLE query after the whole process and automate it.
If I were to add such functionality, I'd add new functions internally next to the existing ones:
convert_pbf_to_gpq- takes downloaded pbf file and returns geoparquet (or parquet in wkt form)convert_geometry_to_gpq- takes geometry, downloads pbf file(s) and returns geoparquet (or parquet in wkt form)get_features_gdf- takes downloaded pbf file and returns geodataframeget_features_gdf_from_geometry- takes geometry, downloads pbf file(s) and returns geodataframe
And new ones:
convert_pbf_to_duckdb- takes downloaded pbf file and saves it in a given duckdb file and table nameconvert_geometry_to_duckdb- takes geometry, downloads pbf file(s) and saves it in a given duckdb file and table name
For this use case, the duckdb file could have a generated name, if not provided, just like with the geoparquet file and have the default table name like osm_features.
The easiest way to do this would be to copy get_features_gdf and get_features_gdf_from_geometry functions:
https://github.com/kraina-ai/quackosm/blob/5d19b3c6024a79578d2543db07eadea139ccf81c/quackosm/pbf_file_reader.py#L431
https://github.com/kraina-ai/quackosm/blob/5d19b3c6024a79578d2543db07eadea139ccf81c/quackosm/pbf_file_reader.py#L533
And just replace Geoparquet to GeoDataFrame transform with the CREATE TABLE query to save it in the DuckDB table (and maybe automatically remove the generated geoparquet file after successful table creation).
The direct streaming of data to DuckDB could be achieved but would require more sophisticated changes in the process (mainly because of the duplicated rows removal when multiple PBF files are parsed at once).
Things to keep in mind:
- DuckDB file could not exist, so it should be created (will be created by Python API if not exist, but only final file, not the directories)
- If file and table exist, the query shouldn't run (overridden with
ignore_cacheflag parameter) - The changes in the API would be required in 3 places:
PbfFileReader,functions.pyfile andcli.pyfile.
For the CLI changes, I have few ideas:
- Plain
--duckdbflag - data saved in the file with a generated name in the default table nameosm_features
quackosm monaco-latest.osm.pbf --duckdb
Finished operation in 0:00:00
files/monaco-latest_nofilter_noclip_compact.duckdb::osm_features
- Passed result file with
.duckdbor.dbextension (--duckdbflag would be optional, inferred from the output file extension)
quackosm monaco-latest.osm.pbf --output monaco.duckdb
Finished operation in 0:00:00
monaco.duckdb::osm_features
- Passed table name - saved in a given table name
quackosm monaco-latest.osm.pbf --output monaco.duckdb --output-table-name osm
Finished operation in 0:00:00
monaco.duckdb::osm
Some additional CLI errors should probably be thrown if there are illogical combinations of parameters (such as the output file name with the geoparquet extension and the passed --duckdb or --output-table-name flags.
You are welcome to try and implement it, I'll try to give you more hints and help with docs (examples) and tests if needed.
Thanks considering this and providing the relevant starting points. I am a little busy, but plan on working on this eventually and provide a PR. If anyone beat me to this, that is fine with me. :smile:
I was finally able to start working on this and hope I'm able to provide a PR soon.