OlivierDehaene comments

Results 119 comments of


OlivierDehaene

Add support for Speculative Decoding

@calvintwr, yes this CPU bottleneck is the reason we often re-write the modelling code in TGI. Speculative decoding is our main priority for the next release.

Safe Tensor converting fails for LLaMa 13B and 30B

Can you clean the cache and re-try? Maybe the file was corrupted.

Falcon-40b-instruct deployment in SageMaker fails when using serial inference pipeline

Yes: use the `--master-port` arg or the `MASTER_PORT` env var.

GPTQ Formats that work (and don't)

@TheBloke, TGI seems to have issues with H100s I'm not sure why yet. Any chance you could test on another device? I was able to launch the model on 1xA10...

GPTQ Formats that work (and don't)

@ssmi153, this warning is a bit dismissive. If you don't see import errors and your architecture is one of the optimized architecture (as displayed in the README), you are using...

Cannot serve llama-30b on T4 gpus, but can serve llama-7b with same code

> The Llama 30B model has num_heads = 52, and it cannot be divided by 8. Therefore, it naturally cannot use shard = 8 for parallel inference. Thanks for the...

Cannot serve llama-30b on T4 gpus, but can serve llama-7b with same code

Solved with the new loading logic.

support build gpu image with lower cuda version

> and there is no nv-driver compatible for both 12.1/12.2 [From the page you linked:](https://docs.nvidia.com/deploy/cuda-compatibility/index.html) > If you are upgrading the driver to 525.60.13 which is the minimum required driver...

OlivierDehaene

Add support for Speculative Decoding

Makefile typo and non posix

Safe Tensor converting fails for LLaMa 13B and 30B

Falcon-40b-instruct deployment in SageMaker fails when using serial inference pipeline

Option for use_fast tokenizer

GPTQ Formats that work (and don't)

GPTQ Formats that work (and don't)

Cannot serve llama-30b on T4 gpus, but can serve llama-7b with same code

Cannot serve llama-30b on T4 gpus, but can serve llama-7b with same code

support build gpu image with lower cuda version