Olatunji Ruwase comments

Results 648 comments of


                                            Olatunji Ruwase

Why ZeRO-2 use more CUDA Memory than ZeRO-1？

@dancingpipi, apologies for the delayed response. Hope the answers below are still helpful. 1. ZeRO is designed to reduce the memory overheads of very large models, with billions of parameters....

'gamma', 'theta' not found in progressive layer drop

@FatCockHu, can you please open a separate ticket for your error? Thanks!

'gamma', 'theta' not found in progressive layer drop

@marchen00, the PLD implementation is split between the DeepSpeed engine and the client. In particular, DeepSpeed maintans the theta and gamma values [here](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/progressive_layer_drop.py), and with [this logic](https://github.com/microsoft/DeepSpeed/blob/9bf1e9af3a3a958fc74b5d5d57e56b72559f5458/deepspeed/runtime/engine.py#L1530-L1531) makes them available...

bing_bert script error

@jeyblu, apologies for the delayed response. Is this still a problem?

bing_bert script error

Can you please share the output of running `ds_report` in your shell?

Bing BERT

Thanks for trying out DeepSpeed. Unfortunately, these datasets are not yet publicly available. We are working on resolving this. Apologies for the inconvenience.

Bing BERT

@piyushghai We are pleased to announce that support for training Bing BERT with Nvidia dataset, #27. Please give it a try.

Bing BERT

@sriramsrao, @oliverhu, @tomekrut We have added support for training with Nvidia dataset. Thanks for the patience. We would really appreciate feedback on your experience trying it out. Thanks!

Bing BERT

@liuyq47 Thanks for trying out the new dataset. Can you be more specific on the timer names and values showing the spikes? The highlighted section of the screenshot seems fine...

Bing BERT

Thanks for the clarification. So to confirm, you are observing occasional spikes of allreduce time from ~229 to ~415. Yes, that does look odd. To help repro for a quick...