DeepSpeedExamples issues

Results 274 DeepSpeedExamples issues

Sort by recently updated

Fixed the dropout setting bug in DeepSpeed SQuDA fine-tune code

It seems there is a bug in our DeepSpeed SQuDA finetune code. There are duplicated keys on dropout probability settings in the model configuration file. With the bug, it is...

minjiaz

Create GPT-Neo on CPU and set the correct device when running on gpus

RezaYazdaniAminabadi

Remove redundant layer norm operation

In the pre-layernorm version of BERT, the application of layernorm on the embeddings is redundant since it is applied by the first transformer layer as well. For reference: https://github.com/NVIDIA/Megatron-LM/blob/19301985dd31c8b612095cbad15bd903e8ddd497/megatron/model/language_model.py#L165

owmohamm

fix bug in bing_bert/utils.py

kisseternity

"ds_train_bert_nvidia_data_bsz64k_seq128.sh" program stalls at the end of the first epoch

When I run "ds_train_bert_nvidia_data_bsz64k_seq128.sh". It stalls at the end of the first epoch. ![image](https://user-images.githubusercontent.com/73824384/128315268-62ff3cff-6e67-45c0-a80a-a6e42d916775.png)

inspur-hsslab

'gamma', 'theta' not found in progressive layer drop

Hi! Thank you guys for the tool and the example. I've been trying to reproduce 'progressive layer dropping' on Roberta and other pretrain methods, but I didn't found where `gamma`...

marchen00

Is there any example with recent version of Megatron-LM?

The examples showed [here](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM-v1.1.5-3D_parallelism) or [here](https://www.deepspeed.ai/tutorials/megatron/) is based on versions about half a year ago. Is there any examples aligned with recent Megatron? Or, is there still relatively obvious optimization...

cryoco

deepspeed_aio_handle_t::_stop_threads(): Assertion `0 == _num_pending_ops' failed.

I keep having this trouble with [Megatrion-LM-v1.1.5-ZeRO3/example/ds_pretrain_gpt2-zero3.sh](https://github.com/microsoft/DeepSpeedExamples/blob/master/Megatron-LM-v1.1.5-ZeRO3/examples/ds_pretrain_gpt2-zero3.sh) and I'm not sure what is causing it. The error is below: ``` python: /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp:159: void deepspeed_aio_handle_t::_stop_threads(): Assertion `0 == _num_pending_ops' failed. Killing...

JIN-096

Unable to run inference example

``` Traceback (most recent call last): File "run_generation.py", line 350, in main() File "run_generation.py", line 261, in main model = deepspeed.init_inference(model, File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 274, in init_inference engine = InferenceEngine(model,...

rahul003

HelloDeepSpeed not reproducible

Hi DeepSpeed community, I was trying to run the HelloDeepSpeed example with a AWS p3.16x instance (8 v100 gpus). However, I was hitting this issue: ``` deepspeed train_bert_ds.py --checkpoint_dir ....

Zha0q1

DeepSpeedExamples
DeepSpeedExamples copied to clipboard

Metadata

Fixed the dropout setting bug in DeepSpeed SQuDA fine-tune code

Create GPT-Neo on CPU and set the correct device when running on gpus

Remove redundant layer norm operation

fix bug in bing_bert/utils.py

"ds_train_bert_nvidia_data_bsz64k_seq128.sh" program stalls at the end of the first epoch

'gamma', 'theta' not found in progressive layer drop

Is there any example with recent version of Megatron-LM?

deepspeed_aio_handle_t::_stop_threads(): Assertion `0 == _num_pending_ops' failed.

Unable to run inference example

HelloDeepSpeed not reproducible

← Metadata

Owner

Metadata

DeepSpeedExamples DeepSpeedExamples copied to clipboard

Metadata

← Metadata

Owner

Metadata

DeepSpeedExamples
DeepSpeedExamples copied to clipboard