pyrosm
pyrosm copied to clipboard
QUESTION: when getting elements within polygon, is it "intersects" or "within"?
As I've mentioned in other threads, in order to handle huge amounts of data, I'm breaking the network for my area of interest into 1.5 x 1.5 km windows.
First I created a geopandas dataframe with geometries and local x,y indices for all the windows. Then I iterate over the rows to (1) isolate the data within that geometry, (2) build the walking network as a NetworkX graph, and (3) save the graph as a pickle with the indexed coordinate name.
To isolate areas I'm using:
thisAreaData = pyrosm.OSM('../Data/OSMData/kanto-latest.osm.pbf', bounding_box=thisBoundingPolygon)
When I earlier isolated the data using complicated admin area polygons, it seemed as though it was pulling all edges that intersected the bounding polygon, even when one of the nodes was outside the bounding polygon.
However, as you can see from the image (below), when creating my square windows there are gaps between the tiles.
Obviously, this cannot be used for my intended purposes because collecting 9 tiles and merging them would just create 9 disconnected networks. I really need the edges to overlap, and the overlapping elements to have the same IDs in both subgraphs, so the network tiles can be properly recombined later. (actually, as you know, , the IDs of ways from OSM are unneeded, because the networkX graph uses the source and target node IDs as the edge ID)
Since the OSMIDs are already unique, I expect those are used as the node IDs, but I haven't checked. Obviously, if the node IDs are built from the intermediary geopandas indices, then that would also make my plan impossible...but I can set the index to those OSMIDs myself if this is the case.
Can you clarify these two points? (1) How can we get all elements that intersect the bounding polygon, rather than within the bounding polygon? (2) When creating a graph, how do I ensure the node (and therefore edge) IDs are kept as the OSMIDs?
Hi @bramson and thanks for good questions!
(1) How can we get all elements that intersect the bounding polygon, rather than within the bounding polygon?
In this scenario, the best way forward is to 1) specify a small buffer around your bounding box polygon (e.g. 0.0015 decimal degrees (~150m) --> requires a bit of testing which works best), so that the areas that are used to filter the data overlap with each other. Then when you have parsed the data with get_network(nodes=True)
with those buffered polygons, you most likely will end up having duplicate edges in your data. 2) Filter out those duplicates by gdf.drop_duplicate(subset=["u", "v"])
. After this, use the to_graph()
function as you would do normally and you should end up with fully connected network.
This perhaps could be something that should be part of the tool itself, so that the user does not need to worry about it, but I need to think if there are some unobvious caveats here. 🤔
(2) When creating a graph, how do I ensure the node (and therefore edge) IDs are kept as the OSMIDs?
When NetworkX graph is created (igraph works differently but not relevant here), the node-ids are the OSMIDs (not index of the frame or anything like that). And indeed the osmid of the edge is not relevant, the graph is created based on edge columns u
and v
(from-id - to-id).
I hope these clarify things a bit?
Hmmmm Using a buffer to accomplish this is a poor solution for two main reasons:
(1) Missed edges: Any feature longer than the chosen buffer size that spans the boundary will not be included. Here I'm thinking mostly of bridges and motorway segments as likely being dropped in some cases.
(2) Memory Inefficient: If a 150m buffer is used, then the tiles are effectively 1800m on a side instead of 1500, with every pair of tiles duplicating ALL the nodes and edges within that buffer, roughly 10% of the data per side. Whereas, if the boundary-intersecting edges are kept, then neighboring tiles are minimally redundant; i.e., all and only the data needed to merge neighboring tiles is maintained in the tiles. There will and should be duplicate nodes/edges in neighboring tiles, but only be the edges that cross the border.
Now, I think what I want is much harder to do, and may not fit well with your approach to reading and converting the OSM data.
My original plan for getting OSM from overpass api was to keep all "ways" if ANY of the way's nodes were in the bounding polygon... Then it could be pruned later so that edges completely outside the bounding area are eliminated. Or maybe, first convert all the ways into edges for the larger system, where the edges hold their x1, y1, x2, y2 coords for the endpoint. Then examine both endpoints and keep the edge if it is within the region. I can build that system, and I was going to until I found your package (thinking that I can't be the first and only developer/researcher who needs this capability), but maybe doing what I propose is MUCH slower than how your system works.
So my question becomes, is it even possible to adapt your method to the border-crossing-edges-inclusive approach that I need for network tiling (and keep similar performance).
Speaking of performance, getting the data within the square bounding polygon takes less than 0.1 seconds. Converting that included data into a networkX graph (get_network
) takes about 4 minutes...even when it turns out that there are zero nodes and edges in the block. These are fairly small areas, so I expected it to be faster per block. Doing my whole region this way would take 16 days. Does that sound right?
@bramson
Whereas, if the boundary-intersecting edges are kept, then neighboring tiles are minimally redundant; i.e., all and only the data needed to merge neighboring tiles is maintained in the tiles.
This would indeed be an optimal case, but the problem here is not actually related to whether use "intersect" or "within" in the spatial query.
The internal structure of OSM.PBF is stuctured in a way that the most efficient way to parse the data happens as follows:
- Parse all nodes from the PBF (and here I mean the OSM element, not nodes related to graph)
- Parse all ways (closed and open) --> the nodes from step 1 are used to create geometries
- Parse all relations --> the nodes from step 1 are used to create geometries
And as it is currently, the spatial filtering happens at the level of nodes
(i.e. step 1). So changing the logic to such that you would parse ways before nodes (i.e. to get information which nodes to keep) requires iterating the same PBF file two times which is not most efficient. I am anyway actually working on doing this kind of approach (i.e. to first filter ways, and then nodes) because that would also allow handling memory much more efficiently, as only the raw data that is needed to create, let's say, street network, is parsed from the PBF. I'll keep this thing in mind when developing that, so perhaps there would be a way to filter the ways in a "smarter way" but the performance will be worse because of this "double iteration" (how much worse, don't know yet).
Speaking of performance, getting the data within the square bounding polygon takes less than 0.1 seconds. Converting that included data into a networkX graph (get_network) takes about 4 minutes...
There is indeed a big drop in speed when the nodes are filtered based on bounding box (as the locations of all nodes in the PBF are always checked against the bounding box). There might be ways to make this more efficient, but haven't thought about this much yet. But yes, if you have a large PBF file, filtering based on bounding box can be quite slow as mentioned here.
But in case you need to do these kind of small area data extractions, pyroms
probably isn't the best tool to do that (as the tool is designed to parse whole pbf files efficiently). But you could use osmnx instead which uses OverPass API as a backend (works efficiently for small areas, such as your case).
Thanks again for your hard work and detailed feedback.
Your workflow is exactly as I imagined it. Because the OSM format is strongly node-first, with everything built from them, it totally makes sense to start from them and filter there.
For the performance, I think you got it backwards. Getting the data for a small grid is amazingly fast 0.1 SECONDS!!! Doing this for a city district took only 0.7 seconds. That part is amazingly fast. So, doing two passes of that step in order to get all the edges with an endpoint within the boundary polygon would be fine, probably not even noticeable (in my case).
Converting it to a network takes 4 minutes...even when the network is empty! This may point to a problem in my system, and one thing that occurred to me is the warning I get for version incompatibility:
C:\WinPython\python-3.6.5.amd64\lib\site-packages\geopandas\_compat.py:88: UserWarning: The Shapely GEOS version (3.8.0-CAPI-1.13.1 ) is incompatible with the GEOS version PyGEOS was compiled with (3.8.1-CAPI-1.13.3). Conversions between both will be slow.
If the original data is using PyGEOS, and getting the nodes/edges is using Shapely's GEOS, and they are converting, then that "will be slow". But, when the area has zero nodes/edges, that shouldn't take any time, no matter how slow, because nothing is converting, but it takes basically the same amount of time as then where are hundreds of nodes/edges.
Because I need the boundary-crossing edges, what I'm going to try next is to first get the nodes,edges geoDFs for a large region first, add the x1, y1, x2, y2 -> geometry columns to the edges, then isolate the blocks using an STRtree of my block grid, then convert the filtered DFs to networkX graphs.
Hi @bramson thanks for the additional details! 👍
Converting that included data into a networkX graph (get_network) takes about 4 minutes..
So just to be sure that, do you mean that when you use the get_network(nodes=True etc.)
it takes 4 minutes? Or when you use to_graph(nodes, edges)
function? For me, it would make more sense that the get_network(nodes=True)
part would consume most of the time because at that point the filtering of nodes happens which is slow. This would also explain the fact that producing empty network takes time. So if this step takes 4 minutes, then things sounds about right.
But, if it is the to_graph
consuming much time, then something does not seem right and I need to investigate more. It should be relatively fast to convert the net to graph, unless you have a huge network, or if you have very low available RAM, which would then cause the system to write memory swaps to disk for accessing additional memory, which tends to be slow. But as said, if it is the get_network()
part, then things work as expected.
This may point to a problem in my system, and one thing that occurred to me is the warning I get for version incompatibility: C:\WinPython\python-3.6.5.amd64\lib\site-packages\geopandas_compat.py:88: UserWarning: The Shapely GEOS version (3.8.0-CAPI-1.13.1 ) is incompatible with the GEOS version PyGEOS was compiled with (3.8.1-CAPI-1.13.3). Conversions between both will be slow.
I hit this warning as well on my system, but this shouldn't influence the performance here at all. It only has an effect if you do some operations with geopandas, such as use sjoin
or similar. This warning actually comes from geopandas
side as you can also see from the warning, so it's not related to pyrosm directly.
Yes, it is the get_network
step that takes the most time. I didn't realize THAT was the step where the filtering happened. I thought it happened in the reading step: thisAreaData = pyrosm.OSM('../Data/OSMData/kanto-latest.osm.pbf', bounding_box=thisBoundingPolygon)
But what you say explains both why the reading step is SO fast, while the get_network
step is so slow...even with the DF is empty.
Although I realize that error comes from geopandas, I never saw it until I started using your package, and your package uses geopandas, so (without knowing exactly what that error even means) I considered it possible that converting the OSM to geopandas involved the geometry conversions that the warning warns about being slow.
I was able to get the walking network for the core of Tokyo (central 23 districts) in 32 minutes, plus 6.5 minutes to convert it to NetworkX. Based on my previous method, the grids for that same region would have taken 20 hours, and they would be disjoint. Since the edge geometries are already attributes of the edges, I can use the edgesDF and my grid geometries to isolate the rows of the edgesDF that intersect each grid...getting exactly what I need to build network tiles.
I'll have to think about how to get a larger region...seems to require two levels of grids, where the larger grids overlap enough that the small grids are never missing edges...basically your first idea, but moved up one level, so the 1.5x1.5 km tiles themselves are optimally connected.
Okay great to hear that you were able to get things working! 👍 I'll let you know here once there hopefully is a better/easier way to do the tiling.