Olatunji Ruwase
Olatunji Ruwase
@barius, @drcege, since these memory issues occur with larger batch sizes, I believe they are due to the increased activation memory footprint. Unfortunately, ZeRO does not help with that memory...
Closing this as out of scope for zero-infinity.
What is the ouput of running `which c++` in your terminal?
@alibabadoufu, do you see this issue with a single node run?
Thanks @s-isaev.
@qinqinqaq, this looks like an OOM. Can you share your GPU memory size?
> Assuming that diagnosis is correct, I'm not sure what the recommended fix would be. Should `get_inactive_params` include `INFLIGHT` params? @adammoody, thanks for the detailed analysis of this bug. To...
@adammoody, can you please try this PR? Thanks!
> The problematic params seem to be from the first few layers of the actor_model, which have been prefetched due to a forward step of the critic_model. I thought maybe...
@adammoody, by the way, I was not able to repro your error on my 4xV100-16GB setup. This makes it harder to resolve.