Olatunji Ruwase

Results 648 comments of Olatunji Ruwase

@barius, @drcege, since these memory issues occur with larger batch sizes, I believe they are due to the increased activation memory footprint. Unfortunately, ZeRO does not help with that memory...

Closing this as out of scope for zero-infinity.

@alibabadoufu, do you see this issue with a single node run?

@qinqinqaq, this looks like an OOM. Can you share your GPU memory size?

> Assuming that diagnosis is correct, I'm not sure what the recommended fix would be. Should `get_inactive_params` include `INFLIGHT` params? @adammoody, thanks for the detailed analysis of this bug. To...

> The problematic params seem to be from the first few layers of the actor_model, which have been prefetched due to a forward step of the critic_model. I thought maybe...

@adammoody, by the way, I was not able to repro your error on my 4xV100-16GB setup. This makes it harder to resolve.