Pravin Gadakh comments

Results 8 comments of


                                            Pravin Gadakh

0.9.4 docker image using cuda 11.7 instead of cuda 11.8

So I upgraded the torch cuda version to 11.8, the python interpreter correctly shows 11.8 now. `conda install --force pytorch==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia` However that did not help...

0.9.4 docker image using cuda 11.7 instead of cuda 11.8

@Narsil I tried with 0.9.4 image, `torch.version.cuda` shows 11.7 only. However even after upgrading the cuda version it did not help. Disabling custom kernels helped but I would prefer not...

0.9.4 docker image using cuda 11.7 instead of cuda 11.8

Even with [sha-7766fee](https://github.com/orgs/huggingface/packages/container/text-generation-inference/114363477?tag=sha-7766fee) I am seeing the same issue.

0.9.4 docker image using cuda 11.7 instead of cuda 11.8

@Narsil Were you able to figure out the issue? Also I have llama2 model deployed on 8 A100 (40 GB) GPUs and had couple of quick questions around that. Can...

0.9.4 docker image using cuda 11.7 instead of cuda 11.8

@Narsil With [sha-f91e9d2](https://github.com/orgs/huggingface/packages/container/text-generation-inference/115569670?tag=sha-f91e9d2) image I am seeing torch cuda version as 11.8 now. However I still need to add `--disable-custom-kernels` in order to deploy llama model. @marioluan You can try...

Unable to run distributed inference on ray with tensor parallel size > 1

@c21 I see you worked on the original distributed example, would you be able to help me find what is it that I am missing here?

Unable to run distributed inference on ray with tensor parallel size > 1

@c21 Apologies for the delay in response, I got occupied with other work stuff. Our raycluster setup has 6 worker nodes, each with 2 A100 80 GB GPUs (18 CPUs)...

failed to install CRD crds/filter.yaml: ... No matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"

@davidhyun We are also stuck with the same issue. May I know what approach did you take to resolve the issue?