llama-stack pytorch CUDA not found in host that has CUDA with working pytorch

I am getting this error.

ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/inference/parallel_utils.py", line 285, in launch_dist_group
    elastic_launch(launch_config, entrypoint=worker_process_entrypoint)(
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
backend_class = ProcessGroupNCCL(
File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1594, in _new_process_group_helper
default_pg, _ = _new_process_group_helper(
File "/usr/local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1368, in init_process_group
func_return = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
torch.distributed.init_process_group("nccl")
File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/inference/generation.py", line 83, in build
llama = Llama.build(config)
File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/inference/model_parallel.py", line 39, in init_model_cb
model = init_model_cb()
File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/inference/parallel_utils.py", line 240, in worker_process_entrypoint
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
traceback : Traceback (most recent call last):
error_file: /tmp/torchelastic_5pr8utde/018126eb-03bf-42ad-add7-00c1e0e4ec6a_dp_sgnfv/attempt_0/0/error.json
exitcode : 1 (pid: 18)
rank : 0 (local_rank: 0)
host : llama-stack-llama3-2-11b-vision-54cf7f9bfd-rz58g
time : 2024-10-16_10:06:03
[0]:
Root Cause (first observed failure):
------------------------------------------------------------
<NO_OTHER_FAILURES>
Failures:
------------------------------------------------------------
worker_process_entrypoint FAILED
============================================================

Context

I built image with llama-stack like this

clone repo master
add docker command --platform linux/amd64
build llama-stack into venv
./env/bin/llama stack build --template local --image-type docker --name llama-stack

CUDA environment

I confirmed that there is CUDA drivers with test CUDA images.

+-----------------------------------------------------------------------------------------+                                                                                                                       
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |                                                                                                                       
|-----------------------------------------+------------------------+----------------------+                                                                                                                       
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |                                                                                                                       
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   41C    P8             17W /   72W |       1MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                          
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

and sample CUDA Pod works too

$ kubectl -n ml logs vector-add
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

pods

apiVersion: v1
kind: Pod
metadata:
  name: cuda-info
  namespace: ml
spec:
  restartPolicy: OnFailure
  containers:
    - name: main
      image: cuda:12.4.1-cudnn-devel-ubuntu22.04
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1

apiVersion: v1
kind: Pod
metadata:
  name: vector-add
  namespace: ml
spec:
  restartPolicy: OnFailure
  containers:
    - name: main
      image: cuda-sample:vectoradd-cuda12.5.0-ubuntu22.04
      resources:
        limits:
          nvidia.com/gpu: 1

CUDA + Pytorch

I confirmed it works on this host.

apiVersion: v1
kind: Pod
metadata:
  name: pytorch-cuda
  namespace: ml
spec:
  containers:
    - name: main
      image: pytorch/pytorch:2.4.1-cuda12.4-cudnn9-devel
      command: ["/bin/sh", "-c", "sleep 1000000"]
      resources:
        limits:
          nvidia.com/gpu: 1

$ kubectl exec -n ml --stdin --tty pytorch-cuda -- /bin/bash
root@pytorch-cuda:/workspace# python3
Python 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.current_device()
0
>>> torch.cuda.device_count() 
1
>>> torch.cuda.get_device_name(0)
'NVIDIA L4'
>>> 
root@pytorch-cuda:/workspace#

Oct 16 '24 10:10 nikolaydubina

I suspect something is wrong with Docker images that llama stack is building, perhaps it does not include CUDA by default?

Oct 16 '24 10:10 nikolaydubina

Our llamastack-local-gpu docker images comes with CUDA, while llamastack-local-cpu do not come with CUDA. What command are you using to starting up the llama stack distribution? You may need to add the --gpus=all flag.

docker run -it -p 5000:5000 -v ~/.llama:/root/.llama --gpus=all llamastack-local-gpu

Oct 17 '24 00:10 yanxi0830

how I start

I am starting container in K8S Pod without command nor arguments. I am relying on default entrypoint that is in the container.

llamastack-local-gpu

I don't think llamastack-local-gpu is used inside image that is being built with ./env/bin/llama stack build --template local --image-type docker --name llama-stack. inspecting Docker commands inside Docker image build with llama stack does not show any references to CUDA nor NVIDIA in layers nor commands. so far it appears to me that llama stack builds image with no GPU support.

Oct 17 '24 04:10 nikolaydubina

huggingface pytorch image like this does work with CUDA. so the host is ok, model is ok. pytorch+cuda is ok. clearly something is wrong with llama stack images.

here is working CUDA + Pytorch + Llama 3.2 11B Vision that works: https://github.com/nikolaydubina/basic-openai-pytorch-server

btw, it is just 3 files and 100 lines of code

Oct 17 '24 11:10 nikolaydubina

This issue has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.

Mar 14 '25 00:03 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant!

Apr 14 '25 00:04 github-actions[bot]

llama-stack llama-stack copied to clipboard

pytorch CUDA not found in host that has CUDA with working pytorch

Context

CUDA environment

CUDA + Pytorch

llama-stack
llama-stack copied to clipboard