pyrosm
pyrosm copied to clipboard
ENH: Add possibility to optimize memory usage
Pyrosm parses quite a lot of attributes by default from the OSM data. Add possibility to optimize memory usage by keeping only the most necessary attributes in the results and dropping everything else out. This helps when working with large files.
The memory optimization should influence the behavior when parsing the PBF (i.e. not parsing less important attributes at all) and when parsing the tags (keeping only the most relevant ones).
Relates to #53
I was going to create an enhancement request to add a "minimal" network_type keyword to get_network, with only geometry and id as keys. Does it fit in this issue ?
@chourmo Yes, that is more or less the idea with this issue 🙂 There should be a parameter to reduce the number of attributes to absolute minimum, and in addition to that give the user the possibility to keep some specific tag using the tags_to_keep
parameter or something like that.
Implementation ideas for this:
- OSM() should have a parameter
keep_meta
that could be used to skip reading metadata from PBF (version, timestamp, etc.) - all "get" methods should have parameter ~~
optimize_memory
~~minimal_tags
that can be set to True for optimizing memory usage by parsing only crucial attributes from the data (e.g. id and geometry) - all get methods should have possibility to specify which are those crucial attributes by passing a list of names (
tags_to_keep
?) - to_graph() should have parameters
keep_edge_geometry
andkeep_node_geometry
that could be used to drop geometry attribute from the graph which are not necessarily needed always
"optimize memory" seems quite ambiguous to me, as the impact on tags is not obvious. "minimal_tags" seems to express results more clearly.
@chourmo Thanks! That is indeed more intuitive option 👍
It would also be great to have geometry=True/False, default to True, to only retrieve tags.
It would also be great to have geometry=True/False, default to True, to only retrieve tags.
@chourmo The idea is to allow user to have full control with tags_to_keep
parameter. So using that you can specify e.g. tags_to_keep=["id", "name"] which would only keep those ones, i.e. meaning that geometry would be dropped out from the final result. The only required tag that will always be kept is the id
(due to how the internals of the library work).
Other ideas:
- Take advantage of
sparse
arrays? Would introduce a new dependency . https://sparse.pydata.org/en/stable/ - decode coordinates to floats only when geometries are parsed (keep as integers before)
- parse ways and relations before nodes, and keep only nodes which belong to those (requires reorganizing the pfbf parsing)
Not sure whether it's the right place to post this, but there is a major possibility for memory improvement when using bounding_box
in pyrosm.OSM
.
A naive implementation I tested was to move nodes parsing directly into the file reading loop, basically deleting get_primitive_blocks_and_string_tables
and moving its code directly into _parse_osm_data
.
On a specific example with the osm.pbf of a French region where I'm only interested in one city, I get the memory footprint of get_buildings
down from 8 GB to 200 MB, so a 4000% decrease.
The downside is that speed goes down from a bit less than 40 s to more than 60 s (150 to 200% increase), but there might be ways to improve that.
EDIT: making it an option, stating that there's a tradeoff between memory and speed could also be an option if speed cannot be improved
I had same issue. I tried to use osmium tool to crop the data in small region with osmium-extract. https://docs.osmcode.org/osmium/latest/osmium-extract.html I could crop and read the pbf file, but still cannot process it with pyrosm.
@chungkang I ended up combining osmium
and pyogrio
, so switching away from pyrosm
as this new library is actively developed by the people from geopandas
.