pytorch-lightning
pytorch-lightning copied to clipboard
BUG when Trainer.test() with deepspeed stage 3
🐛 Bug
Hi, I meet some bugs when I combine lightning with deepspeed follow https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html#trainer-class-api.
I find that:
- Trainer.fit() is compatible well with deepspeed stage 2 and 3 .
- Trainer.test() is compatible with deepspeed stage 2.
- There is some bugs when combine Trainer.test() with deepspeed stage 3:
When initialize DeepSpeedStrategy with deepspeed config and then test, I will encounter this bug, I have tried to change the config, but this bug is exsited whatever.

When initialize DeepSpeedStrategy with lightning args, I will encounter two problems, the generate speed is too slow (I used huggingface transformers and their generate() method) than the deepspeed inference scripts in https://www.deepspeed.ai/tutorials/inference-tutorial/. In addition, I also meet a bug called index of range because of self.__step of deepspeed engine.
Hope for better compatibility between lightning and deepspeed !
cc @awaelchli @rohitgr7 @akihironitta
I've also run into slow inference speeds when using deepspeed stage 3 with Trainer.predict(). I'm seeing around 3.7 batches/second on a V100 (batch_size=1, big bird base model, imdb dataset, sequence length of 4096).
The inference speed is too slow compare with the deepspeed script in https://www.deepspeed.ai/tutorials/inference-tutorial/.