DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Example models using DeepSpeed

Results 274 DeepSpeedExamples issues
Sort by recently updated
recently updated
newest added

It seems there is a bug in our DeepSpeed SQuDA finetune code. There are duplicated keys on dropout probability settings in the model configuration file. With the bug, it is...

In the pre-layernorm version of BERT, the application of layernorm on the embeddings is redundant since it is applied by the first transformer layer as well. For reference: https://github.com/NVIDIA/Megatron-LM/blob/19301985dd31c8b612095cbad15bd903e8ddd497/megatron/model/language_model.py#L165

When I run "ds_train_bert_nvidia_data_bsz64k_seq128.sh". It stalls at the end of the first epoch. ![image](https://user-images.githubusercontent.com/73824384/128315268-62ff3cff-6e67-45c0-a80a-a6e42d916775.png)

Hi! Thank you guys for the tool and the example. I've been trying to reproduce 'progressive layer dropping' on Roberta and other pretrain methods, but I didn't found where `gamma`...

The examples showed [here](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM-v1.1.5-3D_parallelism) or [here](https://www.deepspeed.ai/tutorials/megatron/) is based on versions about half a year ago. Is there any examples aligned with recent Megatron? Or, is there still relatively obvious optimization...

I keep having this trouble with [Megatrion-LM-v1.1.5-ZeRO3/example/ds_pretrain_gpt2-zero3.sh](https://github.com/microsoft/DeepSpeedExamples/blob/master/Megatron-LM-v1.1.5-ZeRO3/examples/ds_pretrain_gpt2-zero3.sh) and I'm not sure what is causing it. The error is below: ``` python: /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp:159: void deepspeed_aio_handle_t::_stop_threads(): Assertion `0 == _num_pending_ops' failed. Killing...

``` Traceback (most recent call last): File "run_generation.py", line 350, in main() File "run_generation.py", line 261, in main model = deepspeed.init_inference(model, File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 274, in init_inference engine = InferenceEngine(model,...

Hi DeepSpeed community, I was trying to run the HelloDeepSpeed example with a AWS p3.16x instance (8 v100 gpus). However, I was hitting this issue: ``` deepspeed train_bert_ds.py --checkpoint_dir ....