OlivierDehaene

Results 119 comments of OlivierDehaene

@calvintwr, yes this CPU bottleneck is the reason we often re-write the modelling code in TGI. Speculative decoding is our main priority for the next release.

Thanks! Can you open a PR?

Can you clean the cache and re-try? Maybe the file was corrupted.

Yes: use the `--master-port` arg or the `MASTER_PORT` env var.

Do you have examples of such models?

@TheBloke, TGI seems to have issues with H100s I'm not sure why yet. Any chance you could test on another device? I was able to launch the model on 1xA10...

@ssmi153, this warning is a bit dismissive. If you don't see import errors and your architecture is one of the optimized architecture (as displayed in the README), you are using...

> The Llama 30B model has num_heads = 52, and it cannot be divided by 8. Therefore, it naturally cannot use shard = 8 for parallel inference. Thanks for the...

> and there is no nv-driver compatible for both 12.1/12.2 [From the page you linked:](https://docs.nvidia.com/deploy/cuda-compatibility/index.html) > If you are upgrading the driver to 525.60.13 which is the minimum required driver...