Reza Yazdani
Reza Yazdani
Hi @xiejw, Can you try this [PR](https://github.com/microsoft/DeepSpeed/pull/2916) using the kernels and as well as mp>1 and see if it works for you? Thanks, Reza
Hi @xiejw, Thanks for trying this out. let me try it on my side again and see if I can repro the same issue. Thanks, Reza
@xiejw, are you trying this the same way I described in the PR?
@xiejw, can you please try this again, passing `--replace_method 'auto'` when running with `inference-test.py`?
Hi @sakogan , Thanks for the PR. Sorry for my delay to get back to this. The issue that you raised here is valid and definitely needs to be addressed....
Thanks, this now makes sense 👍
Hi @slundberg, The cache is controlled by internally by DS-inference. As long as you pass the initial context (input-prompt) to DS-inference, it will consider the previous context as the KV-Cache...
Hi @mallorbc Thanks for reporting this issue. I will try to see if I can repro this on my end. Thanks, Reza
Also, can I ask which PyTorch version are you using here?
Hi, I have fixed some bugs regarding the checkpoint loading for these model architectures. Could you please retry using this PR? You can also try our updated test-suite [here](https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/huggingface/text-generation/inference-test.py). Thanks,...