John Jawed (JJ) comments

Results 20 comments of


                                            John Jawed (JJ)

any idea why this happens?

Never seen that before, not sure why it happens. Can you please output a run with `bash -x`?

any idea why this happens?

Haven’t had a chance to test these in depth yet. Due to the nature of the process; sometimes a failure is OK. I suspect I need to add a flag...

I have run this script, but still my OS version is shwoing 7 only

Can you please provide some output from the script when it attempts to do a yum install?

Improve num_shard support with CUDA_VISIBLE_DEVICES=all

Testing environment: ubuntu 22.04 and no MIG setup (A100). Command: ``` podman run --network host --shm-size 1g --rm --security-opt=label=disable --device=nvidia.com/gpu=all -e CUDA_VISIBLE_DEVICES="all" ghcr.io/huggingface/text-generation-inference:latest --model-id bigscience/bloom-560m ```

Improve num_shard support with CUDA_VISIBLE_DEVICES=all

hi @OlivierDehaene, the lack of the env var in my comment is a copy/paste error. Good catch. Without CUDA_VISIBLE_DEVICES=all this works fine, although only with CPU support and 1 shard...

Improve num_shard support with CUDA_VISIBLE_DEVICES=all

CUDA_VISIBLE_DEVICES=all could be the problem, however, it is currently (mis)used especially in container setups [1]. Here is how I got to supporting `CUDA_VISIBLE_DEVICES=all`. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#environment-variables-oci-spec `all` is a supported value for...

Improve num_shard support with CUDA_VISIBLE_DEVICES=all

> For the doc you linked the env variable is `NVIDIA_VISIBLE_DEVICES` not `CUDA_VISIBLE_DEVICES`. Maybe that explains it ? Yeah, it feels like there is a lot of ambiguity between what...