pyrosm icon indicating copy to clipboard operation
pyrosm copied to clipboard

ENH: Add possibility to optimize memory usage

Open HTenkanen opened this issue 4 years ago • 11 comments

Pyrosm parses quite a lot of attributes by default from the OSM data. Add possibility to optimize memory usage by keeping only the most necessary attributes in the results and dropping everything else out. This helps when working with large files.

The memory optimization should influence the behavior when parsing the PBF (i.e. not parsing less important attributes at all) and when parsing the tags (keeping only the most relevant ones).

Relates to #53

HTenkanen avatar Nov 15 '20 19:11 HTenkanen

I was going to create an enhancement request to add a "minimal" network_type keyword to get_network, with only geometry and id as keys. Does it fit in this issue ?

chourmo avatar Nov 16 '20 07:11 chourmo

@chourmo Yes, that is more or less the idea with this issue 🙂 There should be a parameter to reduce the number of attributes to absolute minimum, and in addition to that give the user the possibility to keep some specific tag using the tags_to_keep parameter or something like that.

HTenkanen avatar Nov 16 '20 08:11 HTenkanen

Implementation ideas for this:

  • OSM() should have a parameter keep_meta that could be used to skip reading metadata from PBF (version, timestamp, etc.)
  • all "get" methods should have parameter ~~optimize_memory~~ minimal_tags that can be set to True for optimizing memory usage by parsing only crucial attributes from the data (e.g. id and geometry)
  • all get methods should have possibility to specify which are those crucial attributes by passing a list of names (tags_to_keep?)
  • to_graph() should have parameters keep_edge_geometry and keep_node_geometry that could be used to drop geometry attribute from the graph which are not necessarily needed always

HTenkanen avatar Nov 19 '20 05:11 HTenkanen

"optimize memory" seems quite ambiguous to me, as the impact on tags is not obvious. "minimal_tags" seems to express results more clearly.

chourmo avatar Nov 19 '20 12:11 chourmo

@chourmo Thanks! That is indeed more intuitive option 👍

HTenkanen avatar Nov 19 '20 12:11 HTenkanen

It would also be great to have geometry=True/False, default to True, to only retrieve tags.

chourmo avatar Nov 21 '20 08:11 chourmo

It would also be great to have geometry=True/False, default to True, to only retrieve tags.

@chourmo The idea is to allow user to have full control with tags_to_keep parameter. So using that you can specify e.g. tags_to_keep=["id", "name"] which would only keep those ones, i.e. meaning that geometry would be dropped out from the final result. The only required tag that will always be kept is the id (due to how the internals of the library work).

HTenkanen avatar Nov 21 '20 09:11 HTenkanen

Other ideas:

  • Take advantage of sparse arrays? Would introduce a new dependency . https://sparse.pydata.org/en/stable/
  • decode coordinates to floats only when geometries are parsed (keep as integers before)
  • parse ways and relations before nodes, and keep only nodes which belong to those (requires reorganizing the pfbf parsing)

HTenkanen avatar Nov 22 '20 18:11 HTenkanen

Not sure whether it's the right place to post this, but there is a major possibility for memory improvement when using bounding_box in pyrosm.OSM.

A naive implementation I tested was to move nodes parsing directly into the file reading loop, basically deleting get_primitive_blocks_and_string_tables and moving its code directly into _parse_osm_data.

On a specific example with the osm.pbf of a French region where I'm only interested in one city, I get the memory footprint of get_buildings down from 8 GB to 200 MB, so a 4000% decrease. The downside is that speed goes down from a bit less than 40 s to more than 60 s (150 to 200% increase), but there might be ways to improve that.

EDIT: making it an option, stating that there's a tradeoff between memory and speed could also be an option if speed cannot be improved

tfardet avatar Aug 11 '23 10:08 tfardet

I had same issue. I tried to use osmium tool to crop the data in small region with osmium-extract. https://docs.osmcode.org/osmium/latest/osmium-extract.html I could crop and read the pbf file, but still cannot process it with pyrosm.

chungkang avatar Sep 13 '23 14:09 chungkang

@chungkang I ended up combining osmium and pyogrio, so switching away from pyrosm as this new library is actively developed by the people from geopandas.

tfardet avatar Sep 13 '23 16:09 tfardet