Olatunji Ruwase comments

Results 648 comments of


                                            Olatunji Ruwase

[BUG] CUDA illegal memory access on large batch with ZeRO-infinity

@barius, @drcege, since these memory issues occur with larger batch sizes, I believe they are due to the increased activation memory footprint. Unfortunately, ZeRO does not help with that memory...

[BUG] CUDA illegal memory access on large batch with ZeRO-infinity

Closing this as out of scope for zero-infinity.

subprocess.CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.

What is the ouput of running `which c++` in your terminal?

Running multinode training and received unclear error for stage 2 training

@alibabadoufu, do you see this issue with a single node run?

Bug in model save with Zero stage 3

Thanks @s-isaev.

[BUG]Step1 RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

@qinqinqaq, this looks like an OOM. Can you share your GPU memory size?

DeepSpeed-Chat: prefetch of layers during reward model forward pass leads to error during sample generation

> Assuming that diagnosis is correct, I'm not sure what the recommended fix would be. Should `get_inactive_params` include `INFLIGHT` params? @adammoody, thanks for the detailed analysis of this bug. To...

DeepSpeed-Chat: prefetch of layers during reward model forward pass leads to error during sample generation

@adammoody, can you please try this PR? Thanks!

DeepSpeed-Chat: prefetch of layers during reward model forward pass leads to error during sample generation

> The problematic params seem to be from the first few layers of the actor_model, which have been prefetched due to a forward step of the critic_model. I thought maybe...

DeepSpeed-Chat: prefetch of layers during reward model forward pass leads to error during sample generation

@adammoody, by the way, I was not able to repro your error on my 4xV100-16GB setup. This makes it harder to resolve.