OlivierDehaene
OlivierDehaene
@sapountzis we had an issue in the past with cuda graphs + quantization. Where you using that by any chance? It should be fixed in v2.0.0.
@sapountzis, we are deprecating bnb in favour of eetq as it is way faster. @mhillebrand, the "Too many open files" is hapenning for the client so tweaking the container ulimit...
Can you run --env ?
That seems to be an issue with Triton. Maybe this comment can help? https://github.com/pytorch/pytorch/issues/107960#issuecomment-1783432552
Weird. We run all our deployments in k8s and don't run into this issue. I will have a look.
- Do you guys deploy in EKS, GKE, something else? - Do you use the Nvidia Operator, Nvidia device plugin? - Do you overwrite the NVIDIA_DRIVER_CAPABILITIES, or LD_LIBRARY_PATH env var...
> We aren't using Kupernetes; we convert the public docker image to singularity to run on our HPC system. I think the original issue came from trouble running the image...
It's possible that your driver is too old. As you can see, it supports Cuda version up to 11.7 while TGI is using 11.8.
@Ichigo3766, what type of hardware do you run TGI with? and are you on > v1?
Nice thank you! I will update the version of torch in the image as soon as 2.1 is out. Cheers!