alpha-zero-general
alpha-zero-general copied to clipboard
Parallelization Improvement Suggestion
In Coach.py
move MCTS
inside executeEpisode
in learn
. Then we can easily parallelize the execute episode operation in parallel using something like joblib?
This sounds like a good idea. Would you have to seed each mcts to prevent them all playing the same game?
Yes you have to seed them differently. Within each iteration at least.
Could you implement this and make a pr?
Perhaps I should leave it up to the author to decide as he stated that he wanted Asynchronous MCTS as described in the paper
. Although parallelizing episodes is much easier.
Open to your suggestion as long as the change to code is minimal and does not hamper the readability/ease of understanding the code!
any preference for parallelization framework i prefer joblib but can probably use python multiprocessing as well
python multiprocessing would be best, no further dependencies
I think the most important part is the batching of neural network input when predicting p and v. When i ran the othello training, i measured that ~85-90% of the time of mcts.search was consumed by the of the nnet. Another experiment on nnet prediction speed and batch size on a cnn (super resolution, 32 layer cnn, 64 channels, pytorch) results in these prediction times (16000 predictions):
batch size 16: [00:50<00:00, 19.79it/s]
batch size 8: [00:57<00:00, 34.93it/s]
batch size 4: [01:54<00:00, 35.01it/s]
batch size 2: [02:49<00:00, 47.08it/s]
batch size 1: [04:57<00:00, 53.70it/s]
Therefore I think it's like: batching gpu input > parallelizing all > parallelizing cpu
Interesting find i did not think of that! Looks like probably need to hold them in a buffer and when it reaches some bs then do the forward computation. Probably need tp hold such changes in another branch as it makes the code much more complex?
I think the most important part is the batching of neural network input when predicting p and v. When i ran the othello training, i measured that ~85-90% of the time of mcts.search was consumed by the of the nnet. Another experiment on nnet prediction speed and batch size on a cnn (super resolution, 32 layer cnn, 64 channels, pytorch) results in these prediction times (16000 predictions):
batch size 16: [00:50<00:00, 19.79it/s] batch size 8: [00:57<00:00, 34.93it/s] batch size 4: [01:54<00:00, 35.01it/s] batch size 2: [02:49<00:00, 47.08it/s] batch size 1: [04:57<00:00, 53.70it/s]
Therefore I think it's like: batching gpu input > parallelizing all > parallelizing cpu
unfortunately this requires major code changes i cant think of anyway without rewriting huge chunks of code?
That is true... Is it possible to run multiple threads for multiple games plus one thread which calls the prediction of batched states? Like all mcts threads are doing their stuff and then call nnet.predict which has a kind of synch block. Then, when a reasonable amount of querys are made, the nnet thread is doing the prediction and unblocks the waiting threads? I'm not really familiar with threads and blocking and stuff, but i guess this would work.
I've used the Ray library to parallelize stuff before. It allows you to split the GPU as well which can be useful for predicting on the GPU.
Some pseudo code
# MCTS calls:
def nnet.predict(board):
id = nnet.numQuerys
nnet.boards[id] = board
nnet.numQuerys += 1
await_nnet_thread_to_do_prediction()
return nnet.predictions[id]
# NNet thread call:
def nnet.compute():
block_treads()
while nnet.numQuerys < nnet.minQuerys:
sleep()
nnet.predictions = nnet.predict_batch(nnet.boards)
nnet.numQuerys = 0
unlock_waiting_threads()
This could be implemented minimal invasive, just check at a call of nnet.predict if parallelism is active and decide which function is used.
I don't know if this would work, but would it make sense to have two branches, one for the vanilla version of the code and one that focuses on performance with multi-threading etc implemented. This would mean that people would still have a fully functioning, easy-to-understand version of the code and then one that is optimized for performance, but is not necessarily as easy to read.
In https://github.com/suragnair/alpha-zero-general/pull/82 there was a version that used multiprocessing, as far as I remember.
I for myself also used the ray library to do the parallelization, which allows a clean, async implementation of the algorithm that can be parallelized across CPU cores / GPU and that can also be scaled across multiple machines (haven't tried this yet, though). I actually found it quite educating to write the async version. Have a look here if you like.
I also agree that ray is the best way to parallelize this project. It allows for easy control when making multiple predictions at once on a GPU, allowing you to simply split it into, say 8 equal parts, and would allow the project to scale effectively to better hardware, which I think is one of its problems.
I wouldn't mind taking a stab at providing a general implementation. I'm not sure though that I understand the higher level bit. From reading the code when MCTS performs a search() each iteration depends on the previous one. Is this incorrect?
I wouldn't mind taking a stab at providing a general implementation. I'm not sure though that I understand the higher level bit. From reading the code when MCTS performs a search() each iteration depends on the previous one. Is this incorrect?
I think this line resets the MCTS. So it can be parallelized. Each episode is like playing the whole game from the initial board state. So execute episode can be done in parallel. https://github.com/suragnair/alpha-zero-general/blob/master/Coach.py#L88
oh interesting... that's a much higher-level parallelization than I was thinking of. Do you think this is better than doing a parallel loop inside of MCTS? I don't really see a downside to doing it in Coach like you've suggested except that during actual play (outside of training) Coach isn't used, and MCTS would not be parallel.
I think parallelizing MCTS is much harder to implement. Depending on the number of simulations it may not be faster. But as said above batching inputs into GPU is the fastest way to implement although for simple games like tic tac toe /connect4 dont really need very deep CNNs can be just faster to run everything in CPU.
oh, I see. My problem is that the slowest part of my particular game is the game simulation, not the CNN learning. The libraries for cnn do a pretty good job utilizing the GPU already.
Well... I actually can't proceed. There's some obscure difference between linux and mac python (I'm using 3.8.1) that prevents me from properly using multiprocessing
library. I can get simple examples to work. but as soon as I work with MCTS library all code fails with TypeError: 'NoneType' object is not callable
🤦
Docker to the rescue?!
On Fri, Sep 25, 2020 at 6:26 PM Mikhail Simin [email protected] wrote:
Well... I actually can't proceed. There's some obscure difference between linux and mac python (I'm using 3.8.1) that prevents me from properly using multiprocessing library. I can get simple examples to work. but as soon as I work with MCTS library all code fails with TypeError: 'NoneType' object is not callable 🤦
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/suragnair/alpha-zero-general/issues/208#issuecomment-699187040, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAF7HKND3BKBXVLSA2TPBV3SHUKIJANCNFSM4P7KRWAA .
Proof of concept here: #221 Can I get someone's opinion on this who is willing to try it?
I gave it a try and it seemed to work with my CPU on PopOS. However I think I have some suggestions/problems.
-
GPU doesn't work. I get the error
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
-
Is each process seeded separately? Otherwise they will all play the same games.
-
Would it make sense to create a number of sub-processes equal to the number of cores available as opposed to the number of episodes to play and have each process play multiple games? I can see that there might be problems with GPU memory if you need to run 100 episodes per iteration because these would all run at the same time.
Thanks, @goshawk22 yeah all valid points. As I said this is just proof of concept. My mac doesn't have cuda so I have to test remotely. Adding a different seeding will be no problem. I think I can use process id to make sure all seeds are unique. And in my testing the multiprocessing library was throttling the number of spawned processes by cpu count. I'll have to confirm in documentation but a simple print loop shows that the iterations are semi ordered.
@suragnair can we get a feature branch for this work?
If anyone has time to help - what would be super useful is a set of some tests that ought to pass when executeEpisode
is parallelized. Right now if this code runs but produces wrong results it's impossible to tell.
@mikhail parallel
branch now available
@mikhail , I'd like to share the implementation of parallelization as it was done earlier in the forked project. https://github.com/evg-tyurin/alpha-nagibator/blob/48b2ebd3ca272f388c13277297edbb60d98eb64b/SelfPlay_MP.py#L27
Main idea is that number of processes is defined in config, each process get an episode to play from a queue, each process reports if the episode is completed successfully. All processes contact the only NN model held in a separate thread. So we save the GPU memory but force the processes to wait few moments for the model is ready for next batch prediction.
Thanks, Surag! And thanks, Evgeny -- I'll use that as a guideline. Right now I'm most worried that particular libraries, like pytorch, might not be configured properly for memory management of the neural net. pytorch has share_memory()
method that needs to be invoked. Scarily -- nothing breaks if you don't invoke it. So I'm afraid you'd just get wrong results (as if only 1 episode was run, or even 0!)
If library-specific functions need to be invoked then a generalized approach is kind of dangerous
Welp... sorry friends. I think I've put about 20 hours into trying to get this working, but there are too many nuances. Between various libraries, incompatibilities between versions of CUDA and ml libraries, incorrect pickling/unpickling, memory limits and python's GIL ... I'm spent. I might give it another shot at another time, but not anytime soon.
I might give this a go using ray. Would this be alright to use an external library?
@goshawk22 If you can get something generic working, I think it'll be great regardless of the implementation. I was trying using the native multiprocessing.
I'll give it a try at some point. I'm a bit busy with school and mock GCSEs at the moment so if anyone else has more time feel free to give it a go - don't wait for me!
On Sat, 3 Oct 2020, 8:35 pm Mikhail Simin, [email protected] wrote:
@goshawk22 https://github.com/goshawk22 If you can get something generic working, I think it'll be great regardless of the implementation. I was trying using the native multiprocessing.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/suragnair/alpha-zero-general/issues/208#issuecomment-703154526, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ4PZR7XZHHSMUWZ2CLZZYLSI54JPANCNFSM4P7KRWAA .
Think about it, @goshawk22 you could solve this for me
😭😭😭
I've got an initial version working using ray. It is by no means finished and I intend to continue working on it. So far, I have successfully implemented multi threaded self-play using the CPU only - it does not yet work with a GPU. I think that to get it working with a GPU, I need to have a separate thread that runs the model predictions, and each self play thread requests a prediction from the GPU thread. Currently I just get OOM errors if I try to split the GPU even between 4 processes. I'll push the code soon. Feel free to make suggestions. Code: https://github.com/goshawk22/alpha-zero-general/tree/ray
I'm not really sure how to have a separate process for getting a model prediction - nothing I tried worked. I'll keep trying, but maybe @evg-tyurin could implement something, seeing how he has already done it? I think that the difficult part is simultaneous model predictions - I'm not really sure how to go about it! I think at this point, the only advantage of my version is that it can scale to use multiple GPUs and even multiple computers if needed.
Another drastically different approach would be to parallelize the part in Coach that compares a new network to a previous one ( https://github.com/suragnair/alpha-zero-general/blob/master/Coach.py#L111 )
Here we could, in parallel, create multiple contenders - each running vs the same previous champion. Which ever wins by more becomes the new winner. The advantage here is that nothing needs to be shared. This could be run on different processes wrapped in a bash script, or even different machines. I don't think this violates any AlphaZero learning rules.