aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

Deadlock when creating many processes

Open greschd opened this issue 7 years ago • 9 comments

When many processes are created, the workers will clog up and keep checking the status of a few processes, without actually running the processes that it depends on.

This is due to the prefetch limit.

Possible ways to solve this:

  • Set the prefetch limit to 0 (no limit). Maybe this would create some problems with the heartbeat?
  • Add a scheduler which tells the workers to "drop" waiting processes and pick up new ones.

greschd avatar Apr 10 '18 15:04 greschd

The minimum amount of work for 1.0.0 should be to at least to see if we can add a warning if a user launches a process and the available daemon worker slots are below a certain percentage/number. Not sure where this should be triggered. The most logical point would be in the launchers, but that means that anytime this is called some sort of query to all daemon workers has to be performed to get the number of open slots, if this is even possible, which might be quite slot.

Alternative which is less accurate but more feasible is to simply get number of workers and multiply with prefetch count global constant and then compare with active number of processes. Actually, just run this only when calling verdi process list and issue the warning at the bottom, just like when the daemon is not running for example. Add instructions with what to do, verdi daemon incr but maybe also link to documentation explaining the problem. Suggested threshold 90%

sphuber avatar Apr 08 '19 11:04 sphuber

I pushed an implementation of this warning fix to a branch in my fork.

What we learned during this process is that the query to generate the table in verdi process list can be quite costly (~ 18 secs on @sphuber 's 10.5 million node DB). The problem with piggy-backing this query to get the number of active processes is that really one needs to do a second query for only the active processes. This is because the user can apply any filters they want to the output of verdi process list but we want to guarantee that the warning will be printed in all cases (except for raw output). 30+ secs to get a response is excessive. The reason for this performance is likely related to filtering on process state, which is stored as an attribute.

A second problem is that getting a CircusClient in order to obtain the number of workers is also quite slow (about 1 second).

A final problem is what to tell the user to do when the warning fires. Currently it suggests that they should increase the number of workers. However, we are not sure what the upper limit for processes should be and what is feasible under a typical OS and system setup. In this sense, perhaps giving this advice blindly is a bad idea.

We need to think about whether printing a warning is the right course of action, for now, and make plans to fix the cause of the problem.

ConradJohnston avatar Apr 16 '19 14:04 ConradJohnston

Thanks @ConradJohnston for the improvement!

@sphuber Should we really close this issue though, or was that just done automatically/accidentally because the PR mentions this issue?

To add to my initial suggestions, I think one could also implement some logic where a process gets dropped (and added to the end of the queue) when it has been waiting for a specific number of "worker cycles" or specific time.

Maybe my tasks were a bit excessive in the number of processes created, but I've found launching a large number of workers to be problematic in terms of CPU load -- for workers which end up not doing much.

greschd avatar May 11 '19 08:05 greschd

Yeah, it was closed accidentally. The heavy CPU load, was that on the alphas of aiida-core. There were some inefficiencies that have been solved that could cause them too spin too often, i.e. the polling fallback was firing all the time because the default was wrongly set to 0. The actual solution to this problem will probably be quite complicated, as any typical solution might add other inefficiencies and something we might want to do, would require functionality from RabbitMQ for which it was not designed and so won't be able to provide.

sphuber avatar May 11 '19 14:05 sphuber

Yeah that was on alpha, glad to hear that should now be fixed.

True, the scheduling definitely isn't an easy fix. Another option (maybe as temporary fix) would also be to have a watcher thread that dynamically scales the workers (as an optional setting).

greschd avatar May 12 '19 01:05 greschd

The proper solution to this will require significant changes to the way RabbitMQ is used. Therefore I am putting this in v2.0.0. For now, the "fix" implemented #2774 should give some breathing room, but the problem is not fixed

sphuber avatar Jun 17 '19 08:06 sphuber

I'm thinking about how to solve this in a more general way, for now I'm collecting information here:

https://github.com/aiidateam/aiida-core/wiki/Scaling-of-AiiDA-workers-and-slots

because it is easier than re-reading the full conversaion each time.

muhrin avatar Dec 12 '19 19:12 muhrin

Should we continue discussing here, and you'll collect it there? Or edit there directly?

The problem with this is that the parent would then not be able to receive messages and be unable to be killed, paused, etc.

I guess when the total number of processes is higher than the number of "slots" = workers * processes per worker, that is kind of unavoidable, right? Is my understanding correct that the message would still get queued and they receive it once some other worker picks them up?

I think some sort of rotation / scheduling is needed here. Some comments:

  • As long as the number of slots is larger than the number of processes, I think a process should not be "pulled off".
  • There could either be a dedicated "load balancer", or some heuristic for when a process is "dropped" while waiting. A middle ground would be a "monitor" that observes process performance / stats, and adjusts "heuristic parameters".

Since this seems like a classical scheduling problem, we should definitely consult existing solutions / algorithms.

greschd avatar Dec 12 '19 23:12 greschd

~~Discussing here is good, I was just thinking of putting summary sutff in the wiki.~~

@gerschd, what you say is good except we still would need to deal with the problem of how to make sure processes that are 'pulled off' can still deal with incoming messages. Additionally we have to be careful that any scheduler like implementation doesn't end up in a situation where, say, a process is waiting for transport but is pulled off before it gets it because it looks like it's doing nothing. In the worse case processes could never end up getting their transport because they are always pulled off before they get it.

summary below (this used to be in the wiki, but is best kept here):

Background information

The AiiDA daemon runs processes by launching workers which each have a finite number of slots representing the number of simultaneous (in the coroutine sense) processes that they can run. Each worker maintains a single connection to the database.

This choice of implementation was guided by the following constraints:

  1. There is a database connection limit (typically 100 for ubuntu) which should be respected
  2. Threads don't always play nice with Django or sqlalchemy and indeed Django uses a separate connection per thread.

Problems

There are a number of problems with the current solution, some of which have been discussed in #1397. Specifically:

  1. AiiDA workflows are allowed to block for lengthy periods of time e.g. if carrying out a complex local calculation. This means that other processes on the same worker are blocked (and can't even receive message), this makes it undesirable to have too many on the same worker.
  2. Workchains may spawn child processes which they wait on, however if there are no more slots left this child will never run and the parent will wait indefinitely whilst blocking a slot.

Possible solutions

  1. Is is tempting to try and pull a waiting parent processes off a worker to allow the child to run and then re-create it allowing it to continue once the child finishes. The problem with this is that the parent would then not be able to receive messages and be unable to be killed, paused, etc.

muhrin avatar Dec 13 '19 08:12 muhrin