OlivierDehaene

Results 149 comments of OlivierDehaene

We will run some benchmarks and see if it make sense to add it to TGI

> I have tried vLLM with Starcoder on A100, and in many cases, it actually performs worse than vanilla HF. Have you tried running starcoder with TGI? You should see...

Which chat UI? You can find each token being streamed in the `token.text` field.

Any chance that this convo can happen on runpod instead of here?

We will fork and add it ourselves to the flash attention cuda kernels.

If I'm not mistaken, this change requires to at least fine tune the model right?

This is one of our priority for the next release.

Have you read the stack trace? In 0.8 your deployment would have oomed at high throughput with your current settings. 0.9 tells you this from the begining and asks you...

1.0 is not a valid value. `top_p` must be > 0.0 and < 1.0.

Yes we will add TE kernels at some point. This is not a high priority for now and will have to wait for a later release.