pandana icon indicating copy to clipboard operation
pandana copied to clipboard

`precompute` memory consumption

Open knaaptime opened this issue 5 years ago • 8 comments

In the past, i've been able to create a pandana network and precompute moderately-sized queries on a laptop (e.g. the linked example precomputes 8000m on an osm network covering the MD-DC-VA MSA).

Using the current version, net.precompute() is consuming tons of memory, often eating up everything on the system. For example, with a network slightly larger than Denver county, the following will eat up all the memory on a linux box with 64gb RAM and crash the process. The same also happens on my macbook.

bbox = (-105.20368772, 39.54191854, -104.50619504, 39.98674731)
net = osmnet.network_from_bbox(bbox=bbox)
network = pdna.Network(net[0]["x"], 
                       net[0]["y"], 
                       net[1]["from"], 
                       net[1]["to"],
                       net[1][["distance"]])
network.precompute(8000)

On the same pdna.network, calling precompute(5000) consumes 40gb of ram

If I don't precompute, I'm able to perform the accessibility queries with hardly any resource consumption (albeit much more slowly, of course).

Any idea what could be happening?

Environment

json 2.0.9 numpy 1.16.2 pandana 0.4.1 osmnet 0.1.5 pandas 0.24.2 compiler : GCC 7.3.0 system : Linux release : 4.18.0-16-generic machine : x86_64 processor : x86_64 CPU cores : 12 interpreter: 64bit

knaaptime avatar Mar 27 '19 23:03 knaaptime

I can't imagine why precompute would be any different. I don't think that code has been touched in ages. Only thing I can think of to set twoway=False and see how much of a difference that makes?

fscottfoti avatar Mar 28 '19 02:03 fscottfoti

🤷‍♂️ that's what i figured, and couldn't see any reason things would be different now. but i can confirm this behavior using the code above in a new conda environment with pandana from the udst channel

unfortunately not seeing any change with twoway=False

knaaptime avatar Apr 08 '19 03:04 knaaptime

I'm seeing the same issue. I can run an aggregate over the same distance, and while it does take a long time (30 mins), it does complete without killing the kernel.

svx3 avatar Jun 07 '19 18:06 svx3

sorry for the circular references, but #104 isn't to blame for this, because i can reproduce using the pre-compiled versions from pip/anaconda

knaaptime avatar Jun 08 '19 02:06 knaaptime

To add on to this, I am also running into this issue with pre-compute.

d-wasserman avatar Jun 26 '19 18:06 d-wasserman

I did a careful analysis of pandana's memory consumption in the precompute step. The conclusions so far are that the memory usage is in line with the data structures that we are storing in memory.

I did my tests with a network with 685K nodes (and around 1M edges). Memory consumption (interpreting this graph as directed) is around 7 to 8 GB in the precompute phase.

In the other hand, doing some math on the data structures, we can see that the precompute method basically creates a collection of std::vectors where the value is another std::vector specifying the target node (as an unsigned int) and a float representing the distance. That’s the dms member of the Accessibility class.

In this example, the reachable nodes are around 1179 in average, given that the data structure has 808 millions of elements. Each element is a pair of (uint, float) that in the tested architecture means 4 + 4 bytes.

In conclusion, the size of the created data structure is in theory 808M * 8 bytes = 6.4 GB.

That is very close to the observed 7 GB, and given the alignment issues and space occupied by the std::vectors themselves, it’s a very reasonable memory consumption.

So the conclusion is that I’m not seeing any memory explosion in the example that I’m following (that is a very big one). It is just using a reasonable amount of memory taking into account the input size.

federicofernandez avatar Sep 24 '19 16:09 federicofernandez

I guess from a user perspective this seems to be a new introduction to the library, but thinking on it it might just be non-linear growth in data consumption at large bandwidths (just by the nature of how many nodes are contacted over larger time periods).

Thanks for taking the time to look at this. If I get a chance to experiment with this more I will report back.

d-wasserman avatar Sep 24 '19 16:09 d-wasserman

agreed, thank you for digging into this

knaaptime avatar Sep 24 '19 17:09 knaaptime