stanford_alpaca
stanford_alpaca copied to clipboard
RuntimeErrorRuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Python: 3.9.5 Ubuntu 20.04.6 LTS
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[2023-05-08 06:00:12,749] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/opt/stanford_alpaca/train.py", line 222, in <module>
File "/opt/stanford_alpaca/train.py", line 222, in <module>
Traceback (most recent call last):
File "/opt/stanford_alpaca/train.py", line 222, in <module>
File "/opt/stanford_alpaca/train.py", line 222, in <module>
train()train()
train()
File "/opt/stanford_alpaca/train.py", line 184, in train
File "/opt/stanford_alpaca/train.py", line 184, in train
File "/opt/stanford_alpaca/train.py", line 184, in train
model_args, data_args, training_args = parser.parse_args_into_dataclasses()model_args, data_args, training_args = parser.parse_args_into_dataclasses()
model_args, data_args, training_args = parser.parse_args_into_dataclasses() File "/opt/venv/lib/python3.9/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
File "/opt/venv/lib/python3.9/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
File "/opt/venv/lib/python3.9/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
train()
File "/opt/stanford_alpaca/train.py", line 184, in train
obj = dtype(**inputs)obj = dtype(**inputs)
File "<string>", line 112, in __init__
obj = dtype(**inputs) File "<string>", line 112, in __init__
File "<string>", line 112, in __init__
File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1259, in __post_init__
File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1259, in __post_init__
File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1259, in __post_init__
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/opt/venv/lib/python3.9/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "<string>", line 112, in __init__
and (self.device.type != "cuda")and (self.device.type != "cuda")
and (self.device.type != "cuda") File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1694, in device
File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1694, in device
File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1694, in device
File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1259, in __post_init__
return self._setup_devicesreturn self._setup_devices
File "/opt/venv/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in __get__
File "/opt/venv/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in __get__
return self._setup_devices
File "/opt/venv/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in __get__
cached = self.fget(obj)cached = self.fget(obj)
File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1626, in _setup_devices
File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1626, in _setup_devices
cached = self.fget(obj)
File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1626, in _setup_devices
and (self.device.type != "cuda")
File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1694, in device
deepspeed.init_distributed(timeout=timedelta(seconds=self.ddp_timeout))deepspeed.init_distributed(timeout=timedelta(seconds=self.ddp_timeout))
File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 624, in init_distributed
deepspeed.init_distributed(timeout=timedelta(seconds=self.ddp_timeout)) File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 624, in init_distributed
File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 624, in init_distributed
return self._setup_devices cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size) File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 60, in __init__
File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 60, in __init__
File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 60, in __init__
File "/opt/venv/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in __get__
self.init_process_group(backend, timeout, init_method, rank, world_size)
self.init_process_group(backend, timeout, init_method, rank, world_size) File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 86, in init_process_group
self.init_process_group(backend, timeout, init_method, rank, world_size) File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 86, in init_process_group
File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 86, in init_process_group
torch.distributed.init_process_group(backend,
torch.distributed.init_process_group(backend, File "/opt/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 895, in init_process_group
torch.distributed.init_process_group(backend, File "/opt/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 895, in init_process_group
File "/opt/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 895, in init_process_group
cached = self.fget(obj)
File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1626, in _setup_devices
default_pg = _new_process_group_helper(
File "/opt/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1009, in _new_process_group_helper
default_pg = _new_process_group_helper(default_pg = _new_process_group_helper(
File "/opt/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1009, in _new_process_group_helper
File "/opt/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1009, in _new_process_group_helper
backend_class = ProcessGroupNCCL(backend_prefix_store, group_rank, group_size, pg_options)
RuntimeErrorbackend_class = ProcessGroupNCCL(backend_prefix_store, group_rank, group_size, pg_options)backend_class = ProcessGroupNCCL(backend_prefix_store, group_rank, group_size, pg_options):
ProcessGroupNCCL is only supported with GPUs, no GPUs found!
RuntimeErrorRuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!:
ProcessGroupNCCL is only supported with GPUs, no GPUs found!
deepspeed.init_distributed(timeout=timedelta(seconds=self.ddp_timeout))
File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 624, in init_distributed
cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 60, in __init__
self.init_process_group(backend, timeout, init_method, rank, world_size)
File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 86, in init_process_group
torch.distributed.init_process_group(backend,
File "/opt/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 895, in init_process_group
default_pg = _new_process_group_helper(
File "/opt/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1009, in _new_process_group_helper
backend_class = ProcessGroupNCCL(backend_prefix_store, group_rank, group_size, pg_options)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9546) of binary: /opt/venv/bin/python3.9
Traceback (most recent call last):
File "/opt/venv/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/opt/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/opt/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/opt/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
Why did I get this error, and How can I solve it? Please help me.
Same error. Any help pls? Can't we train this model on high end CPU machine?
Machine config is: Ubuntu 18.04.6 LTS CPU(s): 16 Thread(s) per core: 2 RAM: 58G
Python version: Python 3.9.16
We can run it on CPU as well with below added args at the last in training command:
--no_cuda --xpu_backend mpi
But I'm facing issues related to MPI:
RuntimeError: Distributed package doesn't have MPI built in. MPI is only included if you build PyTorch from source on a host that has MPI installed.
Any help on this?