alpha-zero-general icon indicating copy to clipboard operation
alpha-zero-general copied to clipboard

Parallelization Improvement Suggestion

Open garyongguanjie opened this issue 4 years ago • 39 comments

In Coach.py move MCTS inside executeEpisode in learn. Then we can easily parallelize the execute episode operation in parallel using something like joblib?

garyongguanjie avatar Aug 14 '20 09:08 garyongguanjie

This sounds like a good idea. Would you have to seed each mcts to prevent them all playing the same game?

goshawk22 avatar Aug 14 '20 11:08 goshawk22

Yes you have to seed them differently. Within each iteration at least.

garyongguanjie avatar Aug 14 '20 13:08 garyongguanjie

Could you implement this and make a pr?

goshawk22 avatar Aug 14 '20 16:08 goshawk22

Perhaps I should leave it up to the author to decide as he stated that he wanted Asynchronous MCTS as described in the paper. Although parallelizing episodes is much easier.

garyongguanjie avatar Aug 14 '20 17:08 garyongguanjie

Open to your suggestion as long as the change to code is minimal and does not hamper the readability/ease of understanding the code!

suragnair avatar Aug 15 '20 04:08 suragnair

any preference for parallelization framework i prefer joblib but can probably use python multiprocessing as well

garyongguanjie avatar Aug 15 '20 07:08 garyongguanjie

python multiprocessing would be best, no further dependencies

suragnair avatar Aug 15 '20 07:08 suragnair

I think the most important part is the batching of neural network input when predicting p and v. When i ran the othello training, i measured that ~85-90% of the time of mcts.search was consumed by the of the nnet. Another experiment on nnet prediction speed and batch size on a cnn (super resolution, 32 layer cnn, 64 channels, pytorch) results in these prediction times (16000 predictions):

batch size 16: [00:50<00:00, 19.79it/s]
batch size  8: [00:57<00:00, 34.93it/s]
batch size  4: [01:54<00:00, 35.01it/s]
batch size  2: [02:49<00:00, 47.08it/s]
batch size  1: [04:57<00:00, 53.70it/s]

Therefore I think it's like: batching gpu input > parallelizing all > parallelizing cpu

mha-py avatar Aug 15 '20 22:08 mha-py

Interesting find i did not think of that! Looks like probably need to hold them in a buffer and when it reaches some bs then do the forward computation. Probably need tp hold such changes in another branch as it makes the code much more complex?

garyongguanjie avatar Aug 16 '20 04:08 garyongguanjie

I think the most important part is the batching of neural network input when predicting p and v. When i ran the othello training, i measured that ~85-90% of the time of mcts.search was consumed by the of the nnet. Another experiment on nnet prediction speed and batch size on a cnn (super resolution, 32 layer cnn, 64 channels, pytorch) results in these prediction times (16000 predictions):

batch size 16: [00:50<00:00, 19.79it/s]
batch size  8: [00:57<00:00, 34.93it/s]
batch size  4: [01:54<00:00, 35.01it/s]
batch size  2: [02:49<00:00, 47.08it/s]
batch size  1: [04:57<00:00, 53.70it/s]

Therefore I think it's like: batching gpu input > parallelizing all > parallelizing cpu

unfortunately this requires major code changes i cant think of anyway without rewriting huge chunks of code?

garyongguanjie avatar Aug 16 '20 09:08 garyongguanjie

That is true... Is it possible to run multiple threads for multiple games plus one thread which calls the prediction of batched states? Like all mcts threads are doing their stuff and then call nnet.predict which has a kind of synch block. Then, when a reasonable amount of querys are made, the nnet thread is doing the prediction and unblocks the waiting threads? I'm not really familiar with threads and blocking and stuff, but i guess this would work.

mha-py avatar Aug 16 '20 15:08 mha-py

I've used the Ray library to parallelize stuff before. It allows you to split the GPU as well which can be useful for predicting on the GPU.

goshawk22 avatar Aug 16 '20 15:08 goshawk22

Some pseudo code

# MCTS calls:
def nnet.predict(board):
    id = nnet.numQuerys
    nnet.boards[id] = board
    nnet.numQuerys += 1
    
    await_nnet_thread_to_do_prediction()
    
    return nnet.predictions[id]

# NNet thread call:
def nnet.compute():
    
    block_treads()
    while nnet.numQuerys < nnet.minQuerys:
        sleep()
    
    nnet.predictions = nnet.predict_batch(nnet.boards)
    nnet.numQuerys = 0
    unlock_waiting_threads()
    

This could be implemented minimal invasive, just check at a call of nnet.predict if parallelism is active and decide which function is used.

mha-py avatar Aug 16 '20 15:08 mha-py

I don't know if this would work, but would it make sense to have two branches, one for the vanilla version of the code and one that focuses on performance with multi-threading etc implemented. This would mean that people would still have a fully functioning, easy-to-understand version of the code and then one that is optimized for performance, but is not necessarily as easy to read.

goshawk22 avatar Aug 25 '20 12:08 goshawk22

In https://github.com/suragnair/alpha-zero-general/pull/82 there was a version that used multiprocessing, as far as I remember.

I for myself also used the ray library to do the parallelization, which allows a clean, async implementation of the algorithm that can be parallelized across CPU cores / GPU and that can also be scaled across multiple machines (haven't tried this yet, though). I actually found it quite educating to write the async version. Have a look here if you like.

peldszus avatar Aug 30 '20 22:08 peldszus

I also agree that ray is the best way to parallelize this project. It allows for easy control when making multiple predictions at once on a GPU, allowing you to simply split it into, say 8 equal parts, and would allow the project to scale effectively to better hardware, which I think is one of its problems.

goshawk22 avatar Aug 31 '20 06:08 goshawk22

I wouldn't mind taking a stab at providing a general implementation. I'm not sure though that I understand the higher level bit. From reading the code when MCTS performs a search() each iteration depends on the previous one. Is this incorrect?

mikhail avatar Sep 22 '20 16:09 mikhail

I wouldn't mind taking a stab at providing a general implementation. I'm not sure though that I understand the higher level bit. From reading the code when MCTS performs a search() each iteration depends on the previous one. Is this incorrect?

I think this line resets the MCTS. So it can be parallelized. Each episode is like playing the whole game from the initial board state. So execute episode can be done in parallel. https://github.com/suragnair/alpha-zero-general/blob/master/Coach.py#L88

garyongguanjie avatar Sep 23 '20 06:09 garyongguanjie

oh interesting... that's a much higher-level parallelization than I was thinking of. Do you think this is better than doing a parallel loop inside of MCTS? I don't really see a downside to doing it in Coach like you've suggested except that during actual play (outside of training) Coach isn't used, and MCTS would not be parallel.

mikhail avatar Sep 23 '20 15:09 mikhail

I think parallelizing MCTS is much harder to implement. Depending on the number of simulations it may not be faster. But as said above batching inputs into GPU is the fastest way to implement although for simple games like tic tac toe /connect4 dont really need very deep CNNs can be just faster to run everything in CPU.

garyongguanjie avatar Sep 24 '20 06:09 garyongguanjie

oh, I see. My problem is that the slowest part of my particular game is the game simulation, not the CNN learning. The libraries for cnn do a pretty good job utilizing the GPU already.

mikhail avatar Sep 24 '20 14:09 mikhail

Well... I actually can't proceed. There's some obscure difference between linux and mac python (I'm using 3.8.1) that prevents me from properly using multiprocessing library. I can get simple examples to work. but as soon as I work with MCTS library all code fails with TypeError: 'NoneType' object is not callable 🤦

mikhail avatar Sep 25 '20 22:09 mikhail

Docker to the rescue?!

On Fri, Sep 25, 2020 at 6:26 PM Mikhail Simin [email protected] wrote:

Well... I actually can't proceed. There's some obscure difference between linux and mac python (I'm using 3.8.1) that prevents me from properly using multiprocessing library. I can get simple examples to work. but as soon as I work with MCTS library all code fails with TypeError: 'NoneType' object is not callable 🤦

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/suragnair/alpha-zero-general/issues/208#issuecomment-699187040, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAF7HKND3BKBXVLSA2TPBV3SHUKIJANCNFSM4P7KRWAA .

dsjoerg avatar Sep 25 '20 23:09 dsjoerg

Proof of concept here: #221 Can I get someone's opinion on this who is willing to try it?

mikhail avatar Sep 26 '20 02:09 mikhail

I gave it a try and it seemed to work with my CPU on PopOS. However I think I have some suggestions/problems.

  1. GPU doesn't work. I get the error RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

  2. Is each process seeded separately? Otherwise they will all play the same games.

  3. Would it make sense to create a number of sub-processes equal to the number of cores available as opposed to the number of episodes to play and have each process play multiple games? I can see that there might be problems with GPU memory if you need to run 100 episodes per iteration because these would all run at the same time.

goshawk22 avatar Sep 26 '20 08:09 goshawk22

Thanks, @goshawk22 yeah all valid points. As I said this is just proof of concept. My mac doesn't have cuda so I have to test remotely. Adding a different seeding will be no problem. I think I can use process id to make sure all seeds are unique. And in my testing the multiprocessing library was throttling the number of spawned processes by cpu count. I'll have to confirm in documentation but a simple print loop shows that the iterations are semi ordered.

mikhail avatar Sep 26 '20 14:09 mikhail

@suragnair can we get a feature branch for this work?

mikhail avatar Sep 26 '20 19:09 mikhail

If anyone has time to help - what would be super useful is a set of some tests that ought to pass when executeEpisode is parallelized. Right now if this code runs but produces wrong results it's impossible to tell.

mikhail avatar Sep 27 '20 00:09 mikhail

@mikhail parallel branch now available

suragnair avatar Sep 27 '20 07:09 suragnair

@mikhail , I'd like to share the implementation of parallelization as it was done earlier in the forked project. https://github.com/evg-tyurin/alpha-nagibator/blob/48b2ebd3ca272f388c13277297edbb60d98eb64b/SelfPlay_MP.py#L27

Main idea is that number of processes is defined in config, each process get an episode to play from a queue, each process reports if the episode is completed successfully. All processes contact the only NN model held in a separate thread. So we save the GPU memory but force the processes to wait few moments for the model is ready for next batch prediction.

evg-tyurin avatar Sep 27 '20 08:09 evg-tyurin

Thanks, Surag! And thanks, Evgeny -- I'll use that as a guideline. Right now I'm most worried that particular libraries, like pytorch, might not be configured properly for memory management of the neural net. pytorch has share_memory() method that needs to be invoked. Scarily -- nothing breaks if you don't invoke it. So I'm afraid you'd just get wrong results (as if only 1 episode was run, or even 0!)

If library-specific functions need to be invoked then a generalized approach is kind of dangerous

mikhail avatar Sep 27 '20 14:09 mikhail

Welp... sorry friends. I think I've put about 20 hours into trying to get this working, but there are too many nuances. Between various libraries, incompatibilities between versions of CUDA and ml libraries, incorrect pickling/unpickling, memory limits and python's GIL ... I'm spent. I might give it another shot at another time, but not anytime soon.

mikhail avatar Sep 28 '20 02:09 mikhail

I might give this a go using ray. Would this be alright to use an external library?

goshawk22 avatar Sep 30 '20 15:09 goshawk22

@goshawk22 If you can get something generic working, I think it'll be great regardless of the implementation. I was trying using the native multiprocessing.

mikhail avatar Oct 03 '20 19:10 mikhail

I'll give it a try at some point. I'm a bit busy with school and mock GCSEs at the moment so if anyone else has more time feel free to give it a go - don't wait for me!

On Sat, 3 Oct 2020, 8:35 pm Mikhail Simin, [email protected] wrote:

@goshawk22 https://github.com/goshawk22 If you can get something generic working, I think it'll be great regardless of the implementation. I was trying using the native multiprocessing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/suragnair/alpha-zero-general/issues/208#issuecomment-703154526, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ4PZR7XZHHSMUWZ2CLZZYLSI54JPANCNFSM4P7KRWAA .

goshawk22 avatar Oct 03 '20 20:10 goshawk22

Think about it, @goshawk22 you could solve this for me

image

😭😭😭

mikhail avatar Oct 16 '20 21:10 mikhail

I've got an initial version working using ray. It is by no means finished and I intend to continue working on it. So far, I have successfully implemented multi threaded self-play using the CPU only - it does not yet work with a GPU. I think that to get it working with a GPU, I need to have a separate thread that runs the model predictions, and each self play thread requests a prediction from the GPU thread. Currently I just get OOM errors if I try to split the GPU even between 4 processes. I'll push the code soon. Feel free to make suggestions. Code: https://github.com/goshawk22/alpha-zero-general/tree/ray

goshawk22 avatar Oct 17 '20 19:10 goshawk22

I'm not really sure how to have a separate process for getting a model prediction - nothing I tried worked. I'll keep trying, but maybe @evg-tyurin could implement something, seeing how he has already done it? I think that the difficult part is simultaneous model predictions - I'm not really sure how to go about it! I think at this point, the only advantage of my version is that it can scale to use multiple GPUs and even multiple computers if needed.

goshawk22 avatar Oct 18 '20 08:10 goshawk22

Another drastically different approach would be to parallelize the part in Coach that compares a new network to a previous one ( https://github.com/suragnair/alpha-zero-general/blob/master/Coach.py#L111 )

Here we could, in parallel, create multiple contenders - each running vs the same previous champion. Which ever wins by more becomes the new winner. The advantage here is that nothing needs to be shared. This could be run on different processes wrapped in a bash script, or even different machines. I don't think this violates any AlphaZero learning rules.

mikhail avatar Oct 28 '20 16:10 mikhail