Blake comments

Results 121 comments of


                                            Blake

[BUG] Incorrect Model Outputs When Using Beam Search

@PanQiWei >Have you tried to disable KV-cache kernel injection and did it solved the problem when using num_beams>1? Won't this remove the speed-up benefit of using DeepSpeed? I guess you...

Mixtral 8x22b

I am currently unable to run the model as well. I could be having another issue unrelated issues though.

[BUG] High VRAM Usage For Inference, Torch Dtype Doesn't Matter

> Hi @mallorbc, > > We have added a test-suite [here ](https://github.com/microsoft/DeepSpeedExamples/pull/223)that measure the memory consumption after `init_inference` and also the pipeline creation. Can you please try it to see...

Different batch sizes lead to different inference results

Today I also discovered this issue for a GPTJ model when doing greedy decoding for batch sizes of 8 vs 16. I am glad to have confirmation that this is...

Different batch sizes lead to different inference results

I looked into it even more and even without using int8, different batch sizes give different results.

Different batch sizes lead to different inference results

> I do not know what is expected behavior after seeing this occur without using int8. When I was doing batch processing for GPTJ, I was using bfloat16, which is...

2022 Mazda CX-5 - Steer Unavailable Below 0 mph

I have noticed this issue on the latest main release for 2022 Mazda Cx5 as well.

Do you have any plans to create the open source version of chatGPT ?

Could be done somewhat well given open sources resources if we had the data. The data would be very expensive to get.

Keeping the model loaded on RAM

Did anyone do this? Running this as a service would be great.

4-bit Integer quantisation

Is there a reason why llama.cpp supports 4 bit quantization on x86 processors but GPTJ does not work with 4 bit and x86? Edit: Looking at some of the commits...