llama
llama copied to clipboard
Error running example on 2 Nvidia A100 GPUs
Trying to run the 65B model on a vast.ai machine - though facing error - can anyone help me, by telling what could be goind wrong.
Error log -
Traceback (most recent call last):
File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/torch/cuda/__init__.py", line 242, in _lazy_init
queued_call()
File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/torch/cuda/__init__.py", line 125, in _check_capability
capability = get_device_capability(d)
File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/torch/cuda/__init__.py", line 357, in get_device_capability
prop = get_device_properties(device)
File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/torch/cuda/__init__.py", line 375, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/llama-dl/llama/example.py", line 119, in <module>
fire.Fire(main)
File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/root/llama-dl/llama/example.py", line 74, in main
local_rank, world_size = setup_model_parallel()
File "/root/llama-dl/llama/example.py", line 25, in setup_model_parallel
torch.cuda.set_device(local_rank)
File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/torch/cuda/__init__.py", line 326, in set_device
torch._C._cuda_setDevice(device)
File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/torch/cuda/__init__.py", line 246, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch.
CUDA call was originally invoked at:
[' File "/root/llama-dl/llama/example.py", line 7, in <module>\n import torch\n', ' File "<frozen importlib._bootstrap>", line 1027, in _find_and_load\n', ' File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked\n', ' File "<frozen importlib._bootst$
Traceback (most recent call last):
File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/torch/cuda/__init__.py", line 242, in _lazy_init
queued_call()
File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/torch/cuda/__init__.py", line 125, in _check_capability
capability = get_device_capability(d)
File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/torch/cuda/__init__.py", line 357, in get_device_capability
prop = get_device_properties(device)
File "/root/anaconda3/envs/ENVNAME/lib/python3.10/site-packages/torch/cuda/__init__.py", line 375, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch.
nvidia-smi output -
Sun Mar 5 15:49:22 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:07:00.0 Off | 0 |
| N/A 29C P0 70W / 400W | 353MiB / 81920MiB | 9% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:0A:00.0 Off | 0 |
| N/A 26C P0 62W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |