pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

BUG when Trainer.test() with deepspeed stage 3

Open SihengLi99 opened this issue 3 years ago • 2 comments

🐛 Bug

Hi, I meet some bugs when I combine lightning with deepspeed follow https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html#trainer-class-api.

I find that:

  • Trainer.fit() is compatible well with deepspeed stage 2 and 3 .
  • Trainer.test() is compatible with deepspeed stage 2.
  • There is some bugs when combine Trainer.test() with deepspeed stage 3:

When initialize DeepSpeedStrategy with deepspeed config and then test, I will encounter this bug, I have tried to change the config, but this bug is exsited whatever.

image

When initialize DeepSpeedStrategy with lightning args, I will encounter two problems, the generate speed is too slow (I used huggingface transformers and their generate() method) than the deepspeed inference scripts in https://www.deepspeed.ai/tutorials/inference-tutorial/. In addition, I also meet a bug called index of range because of self.__step of deepspeed engine.

Hope for better compatibility between lightning and deepspeed !

cc @awaelchli @rohitgr7 @akihironitta

SihengLi99 avatar Aug 08 '22 14:08 SihengLi99

I've also run into slow inference speeds when using deepspeed stage 3 with Trainer.predict(). I'm seeing around 3.7 batches/second on a V100 (batch_size=1, big bird base model, imdb dataset, sequence length of 4096).

jessecambon avatar Aug 08 '22 16:08 jessecambon

The inference speed is too slow compare with the deepspeed script in https://www.deepspeed.ai/tutorials/inference-tutorial/.

SihengLi99 avatar Aug 09 '22 01:08 SihengLi99