John Jawed (JJ)

Results 20 comments of John Jawed (JJ)

Never seen that before, not sure why it happens. Can you please output a run with `bash -x`?

Great suggestion, I’ll add this.

Haven’t had a chance to test these in depth yet. Due to the nature of the process; sometimes a failure is OK. I suspect I need to add a flag...

Was this 7Server or desktop?

Can you please provide some output from the script when it attempts to do a yum install?

Testing environment: ubuntu 22.04 and no MIG setup (A100). Command: ``` podman run --network host --shm-size 1g --rm --security-opt=label=disable --device=nvidia.com/gpu=all -e CUDA_VISIBLE_DEVICES="all" ghcr.io/huggingface/text-generation-inference:latest --model-id bigscience/bloom-560m ```

hi @OlivierDehaene, the lack of the env var in my comment is a copy/paste error. Good catch. Without CUDA_VISIBLE_DEVICES=all this works fine, although only with CPU support and 1 shard...

CUDA_VISIBLE_DEVICES=all could be the problem, however, it is currently (mis)used especially in container setups [1]. Here is how I got to supporting `CUDA_VISIBLE_DEVICES=all`. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#environment-variables-oci-spec `all` is a supported value for...

> For the doc you linked the env variable is `NVIDIA_VISIBLE_DEVICES` not `CUDA_VISIBLE_DEVICES`. Maybe that explains it ? Yeah, it feels like there is a lot of ambiguity between what...