valhalla Possible Performance Up in Matrix

When using time distance matrix, won't it be good to use OpenMP to boost the performance? For example, a 10001000 matrix calculation, with 10 core CPU, we can spawn 10 thread and parallel compute 1001000. Any idea?

May 25 '19 00:05 Guo-astro

parallelism in practice with valhalla

@Guo-astro valhalla is designed to load balance amongst many simultaneous requests. we use zmq as our message passing protocol of choice. in practice you can run valhalla with 10 parallel processes or threads and the system will load balance the requests amongs the threads/processes.

we have described this pattern, known as the parallel pipeline, in detail here: https://github.com/kevinkreiser/prime_server/#the-point

practically speaking you could run valhalla with 10 threads just like so: valhalla_service valhalla.json 10. or you could run it in multi process mode by starting the server, the load balancers (proxies) and the individual workers separately. in either case, you need to break your request up into smaller requests to take advantage of this parallelism.

design decisions

we made the deliberate decision not to introduce parallelism into the graph algorithms themselves because that would lead to a pretty untenable situation in terms of ensuring a high quality of service when used in online server environments with multiple users making simultaneous requests.

essentially if you allow 10 threads to work 1 request, then when the second request comes in it has to wait behind the first one. by setting limits on the "size" of any individual request and dedicating the same amount of "resources" to every request, we reduce the possibility that any one user can deny the service of another. this is especially important in a routing service as the requests are extremely heterogeneous, meaning the complexity of any one request varies greatly with the distance, number of points and road network density among other things.

distributed computing via message passing

at any rate the valhalla config and indeed zeromq allow you to run a massive valhalla cluster if you have a bunch of fat machines accessable to one another via tcp. what you could do is run the server and proxy processes (prime_serverd and a prime_proxyd (1 for loki, 1 for thor and 1 for odin)) on one machine that is reachable from a vast pool of worker machines via tcp. right now the default is to run these all on one machine using linux domain sockets as the message passing mechanism but you can just as well use tcp sockets. then, by changing valhalla.json to tcp addresses, you could hook up 10 machines each with 10 worker processes and have a huge farm to crunch your request load.

May 25 '19 01:05 kevinkreiser

Thanks for the detailed explanation! Another question related is about memory usage in Valhalla. I observed that when calculating the distance matrix, a ~2GB tar file with 2000*2000 elements will use about 25GB memory. So following the design decisions above, using one fat machine, if I use valhalla_service valhalla.json 2, and send 2 requests simultaneously, it will use ~50GB memory, is that correct?

May 25 '19 12:05 Guo-astro

If you want to achieve matrix speedup using concurrency, it's relatively straightforward to split a large matrix into many smaller ones, then re-build a large matrix on the client side. Valhalla can then process the the many smaller matrix requests concurrently. You'd need to benchmark to find the sweet-spot for the size of the smaller pieces, but I've done this in the past and the speedups are pretty much what you'd expect, at the cost of a little bit of extra code and memory on the client side.

Here's a script I wrote eons ago to do this with the Mapbox Matrix API: https://gist.github.com/danpat/83eaed80b6241310ebdf1c0bef04a524, but the approach would be very similar for Valhalla.

May 27 '19 05:05 danpat

@danpat @kevinkreiser Thanks very much for the reply! I have done the chunked method. Now the question becomes what algorithm does prime server used to perform load balancing with multi requests? From the top -H command, with valhalla_service valhalla.json 1, I observed 16 threads running. From the naive understand, 1 for httpd server, 3 for proxies, 3 for worker -> 7 all. Why I got 16?

May 28 '19 04:05 Guo-astro

in our code we spawn a thread per item and then on top of that zmq uses threads under the hood (see the section marked I/O threads here: http://zeromq.org/whitepapers:architecture) so you end up with:

main thread server thread + 1 zmq thread loki proxy thread + 1 zmq thread loki worker thread + 1 zmq thread thor proxy thread + 1 zmq thread thor worker thread + 1 zmq thread odin proxy thread + 1 zmq thread odin worker thread + 1 zmq thread

Which makes... 15 threads, im not sure where the 16th is coming from at the moment...

May 28 '19 14:05 kevinkreiser

@kevinkreiser and others - Sorry for my naive question. I just started looking into Valhalla. Do I need to use prime_server or could I have my own load balancer (with a bunch of servers behind it) with a number of true parallel threads running on each server? Maybe I have a few days of reading ahead to wrap my head around it. Please advise. Thank you!

Jul 04 '20 01:07 abvhinash

prime-server stitches together the various services - you will need to use it to serve traffic over HTTP.

The other modules (loki, thor, odin, etc) communicate with each other over ZeroMQ - by default, it's configured to allow the modules to talk over IPC sockets (Unix domain sockets) on the same machine. prime-server is a required module if you want to access everything via HTTP - it's the only part that implements the HTTP protocol.

Here's an example in the config where the ZeroMQ endpoints are configured:

https://github.com/valhalla/valhalla/blob/master/scripts/valhalla_build_config#L87

Lines like this are sprinkled through the config.

In theory, you can modify that to use network sockets instead of ipc://, and thus distribute the modules across multiple servers - but I haven't seen it done very often (or at all to be honest). @kevinkreiser would know whether it's in a working state or not, and exactly how to configure it.

Jul 06 '20 15:07 danpat

Hi I ve written code to break the matrix request , call the matrix API calls and reassemble them.

I ve also checked that the threads are running in parallel on the server consuming all cores. however, I am getting slower speed than a single call (where it runs on single HW thread).

I tested only on smaller matrix so far on 500x500 size.

It almost seems that the threads are locking for access internally, but again, I could be wrong.
Has anyone got success with speedup ?

P.S. in my setup, the actor instances reuse the baldr::graphReader pointer to reuse the graph tiles among instances.

Feb 14 '24 18:02 mkandulavm

Part of the problem with this approach is that there will be much repeated work - diving 1 large matrix into N smaller ones does not mean that each small request does 1/N work.

The reason is that edge exploration will occur multiple times. If you have coordinate A that appears in several of your split requests, then Valhalla will have to repeat the routing from A in each of the requests it appears in. When you perform one request, that only happens once.

As for locking - you did not say how you're running Valhalla - are you using prime_httpd or using valhalla_service ? Are you running it with multiple threads? How did you start it?

Feb 14 '24 18:02 danpat

are you using prime_httpd or using valhalla_service

I am not using these. I created actor instances on the server side c++ code.

Feb 14 '24 19:02 mkandulavm