OlivierDehaene comments

Results 149 comments of


                                            OlivierDehaene

Support PagedAttention

We will run some benchmarks and see if it make sense to add it to TGI

Support PagedAttention

> I have tried vLLM with Starcoder on A100, and in many cases, it actually performs worse than vanilla HF. Have you tried running starcoder with TGI? You should see...

Falcon 40B Instruct not generating in stream mode.

Which chat UI? You can find each token being streamed in the `token.text` field.

Cannot deploy Falcon-40B-instruct server because of low fixed timeout on startup

Any chance that this convo can happen on runpod instead of here?

Support for mosaicml/mpt-30b-instruct model

We will fork and add it ourselves to the flash attention cuda kernels.

Support for extended context for LlaMA Based models

If I'm not mistaken, this change requires to at least fine tune the model right?

Guidance acceleration

This is one of our priority for the next release.

Warming up model

Have you read the stack trace? In 0.8 your deployment would have oomed at high throughput with your current settings. 0.9 tells you this from the begining and asks you...

Fix Error messages

1.0 is not a valid value. `top_p` must be > 0.0 and < 1.0.

New NVIDIA partnership: TE inference speedup

Yes we will add TE kernels at some point. This is not a high priority for now and will have to wait for a later release.