Blake
Blake
@PanQiWei >Have you tried to disable KV-cache kernel injection and did it solved the problem when using num_beams>1? Won't this remove the speed-up benefit of using DeepSpeed? I guess you...
I am currently unable to run the model as well. I could be having another issue unrelated issues though.
> Hi @mallorbc, > > We have added a test-suite [here ](https://github.com/microsoft/DeepSpeedExamples/pull/223)that measure the memory consumption after `init_inference` and also the pipeline creation. Can you please try it to see...
Today I also discovered this issue for a GPTJ model when doing greedy decoding for batch sizes of 8 vs 16. I am glad to have confirmation that this is...
I looked into it even more and even without using int8, different batch sizes give different results.
> I do not know what is expected behavior after seeing this occur without using int8. When I was doing batch processing for GPTJ, I was using bfloat16, which is...
I have noticed this issue on the latest main release for 2022 Mazda Cx5 as well.
Could be done somewhat well given open sources resources if we had the data. The data would be very expensive to get.
Did anyone do this? Running this as a service would be great.
Is there a reason why llama.cpp supports 4 bit quantization on x86 processors but GPTJ does not work with 4 bit and x86? Edit: Looking at some of the commits...