rlberry
rlberry copied to clipboard
Multiprocessing is slow
running "ltest_dqn_vs_mdqn_acrobot.py" with 10000 budget.
doing n_fit=4 is longer than 2* n_fit=2 when using parallelization="process"
TODO : add regression test 2fit faster than 2*1fit (with multiprocessing)
From the tests I did, it seems to be a conflict between python multiprocessing and pytorch multiprocessing.
I just tried by replacing everything multiprocessing in AgentManager by joblib and there is no problem anymore, n_fit=4 is faster than 2 times n_fit=2.
@omardrwch : why did you choose not to use joblib before ? It is a lot simpler to code, and I don't see why you would need multiprocessing instead.
Hello! Actually, in the very first implementation of AgentManager, I was using joblib. But - at least at that time - there was a problem with jobs that were creating subprocesses themselves (i.e., if an Agent created by an AgentManager creates new processes). If I remember correctly, I got the error daemonic processes are not allowed to have children.
Another advantage of multiprocessing is that possibility of using spawn, which is more robust (each agent basically having its own interpreter), e.g. https://stackoverflow.com/a/66113051.
We could maybe add a parallelization = "joblib" option in AgentManager, but I think it's important to keep Python's multiprocessing as an option for those reasons.
Another suggestion can be to use the multiprocessing subpackage of PyTorch (https://pytorch.org/docs/stable/multiprocessing.html#module-torch.multiprocessing) instead of the std one.
Little document about multiprocessing best practices in Pytorch : https://pytorch.org/docs/stable/notes/multiprocessing.html
Hi @omardrwch. I have been using rlberry for some time. I have encountered an issue that could be related to the multiprocessing module mentioned in this issue. I was running 20 simple bandit experiments with 250 horizon with 1000 workers (simulations). The simulation times took longer and longer, where in the first experiment it took 7s to complete one simulation while it took 137s for one run in the 20-th experiment.
From the time_elapsed data I recorded in my local database, there are noticeable gaps within one simulation run. For example, the time_elapsed could jump by several seconds instead of constantly increasing as expected. Does this has something to do with the conflic between multiprocessing modules from different packages as mentioned by @TimotheeMathieu.
To help illustrate this, I have uploaded a snapshot of the data I recorded. Thanks in advance. snapshot.csv