GridWorlds.jl icon indicating copy to clipboard operation
GridWorlds.jl copied to clipboard

Accelerate parallel environment interactions with GPU

Open findmyway opened this issue 4 years ago • 6 comments

findmyway avatar Apr 16 '21 06:04 findmyway

@findmyway This would be my first time working with GPU programming (or any form of concurrent programming for that matter). So I have a few question:

  1. I've heard that there is a cost incurred upon moving memory from CPU to GPU. I think the present CPU performance of the environments (see benchmark.md) appears to be quite good relative to the time it would take to train a nerual network. Can you please explain to me how do we decide if using a GPU is worth it?
  2. In relation to 1., how do we decide upon incorporating multithreading on CPU vs using a GPU for acceleration?
  3. If we are going to use a GPU, what exactly does it entail? Does this mean storing the BitArray{3} on the GPU so that the state can directly be taken to the neural network weight stored on the GPU? Where does environment logic get executed - CPU or GPU?

Sid-Bhatia-0 avatar Apr 16 '21 10:04 Sid-Bhatia-0

Actually Q3 answers Q1, we don't need to transfer data between CPU and GPU in most cases.

Does this mean storing the BitArray{3} on the GPU so that the state can directly be taken to the neural network weight stored on the GPU?

Yes

Where does environment logic get executed - CPU or GPU?

GPU

findmyway avatar Apr 16 '21 10:04 findmyway

I have thought more deeply about it.

From what I understand, if we are to use the GPU, then the env instance would sit on the GPU and all environment related computations will happen there so that the neural network can relatively easily access the state.

I want to know if it would be worth doing the env logic like taking actions (that doesn't have much parallelism in it) on the GPU vs. doing everything on the CPU and moving data between CPU and GPU at each step. Let n = total number of steps required to be executed in env in order to train a policy from scratch x = avg. cost of env logic per step on GPU y = avg. cost of env logic per step on CPU (potentially implemented in multithreaded fashion, which would give even more performance than presently in benchmark.md) + avg. cost of moving state from CPU to GPU + avg. cost of moving computed action from GPU to CPU to be executed in the env. z = avg. total cost of fully training a policy in env on GPU from scratch excluding env logic.

Ideally, we would want to use the GPU if: (n*y)/z > (n*x)/z If LHS is significantly greater than RHS, then we can justify this feature. Correct me if I am wrong, but it is not obvious to me that this equation holds.

Even more importantly, if (n*x)/z << 1, that is, the total cost of env logic on CPU is much less than total cost of training on GPU excluding env logic, then we won't be gaining much by incorporating GPU support. My initial hunch is that this would hold true because the env logic is quite simple and should cost way less than training a neural network. I have left out reset!, assuming that it will get ammortized over the total number of steps and isn't significantly costly overall.

What do you think?

Sid-Bhatia-0 avatar Apr 16 '21 14:04 Sid-Bhatia-0

  1. You may have underestimated the cost of moving data between CPU and GPU.
  2. I think you are only talking about ONE environment here. But what I mean here is rolling out multiple environment instances simultaneously (usually hundreds of environments)

findmyway avatar Apr 16 '21 15:04 findmyway

It seems someone has already done something related:

https://discourse.julialang.org/t/alphagpu-an-alphazero-implementation-wholly-on-gpu/60030

findmyway avatar Apr 27 '21 10:04 findmyway

Thanks for pointing it out! By the way, I won't be able to work on GPUization anytime soon.

Sid-Bhatia-0 avatar Apr 27 '21 20:04 Sid-Bhatia-0