Olatunji Ruwase comments

Results 634 comments of


                                            Olatunji Ruwase

In distributed training, in order to continue training, an error occurred when loading model checkpoints after saving them.

Closing for lack of response. Please re-open if needed.

RuntimeError: Error(s) in loading state_dict

@lxd551326, it seems you seeing two different issues. 1. CUDA OOM using DeepSpeed for a model that works with pure pytorch is very strange and should be investigated. Can you...

RuntimeError: Error(s) in loading state_dict

@lhyscau, @DavidYanAnDe, and @lxd551326 are you able to provide repro steps?

[BUG] Does deepspeed work with torch amp autocast?

@lqniunjunlper, are you able to share repro steps for this issue? Thanks

[BUG] is_zero_init_model is always False when I'm using zero_init!

@Taiinguyenn139, thanks for helping to resolve this issue. Closing this issue.

[BUG] Universal checkpoint conversion - "Cannot find layer_01* files in there"

@exnx, thanks for debugging this issue. Your analysis is correct. The purpose of that assertion is to confirm that existence of at least one `layer_*` file if using pipeline parallelism....

[BUG] localhost: Permission denied, please try again. with single node and multi-gpus with --autotuning run

@Looong01, it seems your `localhost` is not configured for password-less ssh, which is a requirement for DeepSpeed. Please see https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node Although you are using a single-node, `--autotuning` option operates as...

[BUG] localhost: Permission denied, please try again. with single node and multi-gpus with --autotuning run

Closing for lack of activity. Please re-open if needed.

[BUG] Excessive CPU and GPU Memory Usage with Multi-GPU Inference Using DeepSpeed

@gawain000000, can you clarify your goals because there are two different solutions for latency and throughput (and low budget) scenarios. I noticed the use of `deepspeed.init_inference` and zero stage 3...

[REQUEST] dynamic batch size with gradient accumulate

@Xiang-cd, gradient accumulation in deepspeed works as follows 1. Assume each training [iteration](https://www.deepspeed.ai/getting-started/#training) consists of fwd, bwd, step. 2. Increment [micro-step counter](https://github.com/microsoft/DeepSpeed/blob/2a56f53395b2e0ef2ffe9947671fe153ba026328/deepspeed/runtime/engine.py#L2279) in step, and use configured gradient accumulation steps...