rlberry icon indicating copy to clipboard operation
rlberry copied to clipboard

Multiprocessing is slow

Open JulienT01 opened this issue 2 years ago • 4 comments

running "ltest_dqn_vs_mdqn_acrobot.py" with 10000 budget.

doing n_fit=4 is longer than 2* n_fit=2 when using parallelization="process"

TODO : add regression test 2fit faster than 2*1fit (with multiprocessing)

JulienT01 avatar Jul 07 '23 08:07 JulienT01

From the tests I did, it seems to be a conflict between python multiprocessing and pytorch multiprocessing.

I just tried by replacing everything multiprocessing in AgentManager by joblib and there is no problem anymore, n_fit=4 is faster than 2 times n_fit=2.

@omardrwch : why did you choose not to use joblib before ? It is a lot simpler to code, and I don't see why you would need multiprocessing instead.

TimotheeMathieu avatar Jul 07 '23 14:07 TimotheeMathieu

Hello! Actually, in the very first implementation of AgentManager, I was using joblib. But - at least at that time - there was a problem with jobs that were creating subprocesses themselves (i.e., if an Agent created by an AgentManager creates new processes). If I remember correctly, I got the error daemonic processes are not allowed to have children.

Another advantage of multiprocessing is that possibility of using spawn, which is more robust (each agent basically having its own interpreter), e.g. https://stackoverflow.com/a/66113051.

We could maybe add a parallelization = "joblib" option in AgentManager, but I think it's important to keep Python's multiprocessing as an option for those reasons.

omardrwch avatar Jul 10 '23 07:07 omardrwch

Another suggestion can be to use the multiprocessing subpackage of PyTorch (https://pytorch.org/docs/stable/multiprocessing.html#module-torch.multiprocessing) instead of the std one.

Little document about multiprocessing best practices in Pytorch : https://pytorch.org/docs/stable/notes/multiprocessing.html

riiswa avatar Jul 10 '23 13:07 riiswa

Hi @omardrwch. I have been using rlberry for some time. I have encountered an issue that could be related to the multiprocessing module mentioned in this issue. I was running 20 simple bandit experiments with 250 horizon with 1000 workers (simulations). The simulation times took longer and longer, where in the first experiment it took 7s to complete one simulation while it took 137s for one run in the 20-th experiment.

From the time_elapsed data I recorded in my local database, there are noticeable gaps within one simulation run. For example, the time_elapsed could jump by several seconds instead of constantly increasing as expected. Does this has something to do with the conflic between multiprocessing modules from different packages as mentioned by @TimotheeMathieu.

To help illustrate this, I have uploaded a snapshot of the data I recorded. Thanks in advance. snapshot.csv

RockmanZheng avatar Oct 24 '23 14:10 RockmanZheng