DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Example models using DeepSpeed
It seems there is a bug in our DeepSpeed SQuDA finetune code. There are duplicated keys on dropout probability settings in the model configuration file. With the bug, it is...
In the pre-layernorm version of BERT, the application of layernorm on the embeddings is redundant since it is applied by the first transformer layer as well. For reference: https://github.com/NVIDIA/Megatron-LM/blob/19301985dd31c8b612095cbad15bd903e8ddd497/megatron/model/language_model.py#L165
When I run "ds_train_bert_nvidia_data_bsz64k_seq128.sh". It stalls at the end of the first epoch. ![image](https://user-images.githubusercontent.com/73824384/128315268-62ff3cff-6e67-45c0-a80a-a6e42d916775.png)
Hi! Thank you guys for the tool and the example. I've been trying to reproduce 'progressive layer dropping' on Roberta and other pretrain methods, but I didn't found where `gamma`...
The examples showed [here](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM-v1.1.5-3D_parallelism) or [here](https://www.deepspeed.ai/tutorials/megatron/) is based on versions about half a year ago. Is there any examples aligned with recent Megatron? Or, is there still relatively obvious optimization...
I keep having this trouble with [Megatrion-LM-v1.1.5-ZeRO3/example/ds_pretrain_gpt2-zero3.sh](https://github.com/microsoft/DeepSpeedExamples/blob/master/Megatron-LM-v1.1.5-ZeRO3/examples/ds_pretrain_gpt2-zero3.sh) and I'm not sure what is causing it. The error is below: ``` python: /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp:159: void deepspeed_aio_handle_t::_stop_threads(): Assertion `0 == _num_pending_ops' failed. Killing...
``` Traceback (most recent call last): File "run_generation.py", line 350, in main() File "run_generation.py", line 261, in main model = deepspeed.init_inference(model, File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 274, in init_inference engine = InferenceEngine(model,...
Hi DeepSpeed community, I was trying to run the HelloDeepSpeed example with a AWS p3.16x instance (8 v100 gpus). However, I was hitting this issue: ``` deepspeed train_bert_ds.py --checkpoint_dir ....