llama icon indicating copy to clipboard operation
llama copied to clipboard

Error running example on 2 Nvidia A100 GPUs

Open bhanuc opened this issue 1 year ago • 3 comments

Trying to run the 65B model on a vast.ai machine - though facing error - can anyone help me, by telling what could be goind wrong.

Error log -

Traceback (most recent call last):
  File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/torch/cuda/__init__.py", line 242, in _lazy_init
    queued_call()
  File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/torch/cuda/__init__.py", line 125, in _check_capability
    capability = get_device_capability(d)
  File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/torch/cuda/__init__.py", line 357, in get_device_capability
    prop = get_device_properties(device)
  File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/torch/cuda/__init__.py", line 375, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/llama-dl/llama/example.py", line 119, in <module>
    fire.Fire(main)
  File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/root/llama-dl/llama/example.py", line 74, in main
    local_rank, world_size = setup_model_parallel()
  File "/root/llama-dl/llama/example.py", line 25, in setup_model_parallel
    torch.cuda.set_device(local_rank)
  File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/torch/cuda/__init__.py", line 326, in set_device
    torch._C._cuda_setDevice(device)
  File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/torch/cuda/__init__.py", line 246, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch.

CUDA call was originally invoked at:

['  File "/root/llama-dl/llama/example.py", line 7, in <module>\n    import torch\n', '  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load\n', '  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked\n', '  File "<frozen importlib._bootst$
Traceback (most recent call last):
  File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/torch/cuda/__init__.py", line 242, in _lazy_init
    queued_call()
  File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/torch/cuda/__init__.py", line 125, in _check_capability
    capability = get_device_capability(d)
  File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/torch/cuda/__init__.py", line 357, in get_device_capability
    prop = get_device_properties(device)
  File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/torch/cuda/__init__.py", line 375, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch.

nvidia-smi output -

Sun Mar  5 15:49:22 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   29C    P0    70W / 400W |    353MiB / 81920MiB |      9%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   26C    P0    62W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |

bhanuc avatar Mar 05 '23 21:03 bhanuc