DeepSpeed
DeepSpeed copied to clipboard
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Fix issue #2090 by converting microsecond to second
I got an error using MoQ on ROCM environment on AMD GPU error: ‘’‘ Traceback (most recent call last): File "train_deepspeed.py", line 694, in main(args) File "train_deepspeed.py", line 554, in...
This bug prevents to run Megatron-LM 10B offload training example
**Note** I found a bug and a fix. However, rather than directly submitting a fix and PR, I'm reporting the issue. If I have time I'll also submit a PR....
## Describe the bug I tried to infer gpt2 model with under code. The code use the DeepSpeed inference optimization. When I constantly repeated model inference, `floating point exception(core dump)`...
**Describe the bug** https://github.com/microsoft/DeepSpeed/pull/1705 add line to overwrite the input_mask(attention_mask) at DeepSpeedSelfAttentionFunction to dummy attention mask. Due to this code, `attention_mask` input has been ignored for all transformer models forwards....
**Describe the bug** The traditional way of model.eval() seems doesn't work with DeepSpeed Transformer Kernel. The training flag is changed, however, the randomness is still there. **To Reproduce** I've made...
Hello, I am new user of the DeepSpeed(DS) and I successfully trained checkpoints using DS. However, I met issue when trying to used the checkpoint for inference. I want to...
Hi, I am trying to finetune the meta OPT-66B, however, our system always tells me that the memory is not enough. >Max vmem = 434.289G Max rss = 315.860G failed...
**Is your feature request related to a problem? Please describe.** What is the best way to run DeepSpeed inference in C++? **Describe the solution you'd like** Documenting if it is...