stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

RuntimeErrorRuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

Open MyIcecream opened this issue 2 years ago • 2 comments

Python: 3.9.5 Ubuntu 20.04.6 LTS

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[2023-05-08 06:00:12,749] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/opt/stanford_alpaca/train.py", line 222, in <module>
  File "/opt/stanford_alpaca/train.py", line 222, in <module>
Traceback (most recent call last):
  File "/opt/stanford_alpaca/train.py", line 222, in <module>
  File "/opt/stanford_alpaca/train.py", line 222, in <module>
        train()train()    
train()

  File "/opt/stanford_alpaca/train.py", line 184, in train
  File "/opt/stanford_alpaca/train.py", line 184, in train
  File "/opt/stanford_alpaca/train.py", line 184, in train
            model_args, data_args, training_args = parser.parse_args_into_dataclasses()model_args, data_args, training_args = parser.parse_args_into_dataclasses()

model_args, data_args, training_args = parser.parse_args_into_dataclasses()  File "/opt/venv/lib/python3.9/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
  File "/opt/venv/lib/python3.9/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses

  File "/opt/venv/lib/python3.9/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
    train()
  File "/opt/stanford_alpaca/train.py", line 184, in train
        obj = dtype(**inputs)obj = dtype(**inputs)
    
  File "<string>", line 112, in __init__
obj = dtype(**inputs)  File "<string>", line 112, in __init__

  File "<string>", line 112, in __init__
  File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1259, in __post_init__
  File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1259, in __post_init__
  File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1259, in __post_init__
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/opt/venv/lib/python3.9/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 112, in __init__
        and (self.device.type != "cuda")and (self.device.type != "cuda")    

and (self.device.type != "cuda")  File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1694, in device

  File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1694, in device
  File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1694, in device
  File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1259, in __post_init__
        return self._setup_devicesreturn self._setup_devices

  File "/opt/venv/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in __get__
  File "/opt/venv/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in __get__
    return self._setup_devices
  File "/opt/venv/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in __get__
            cached = self.fget(obj)cached = self.fget(obj)

  File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1626, in _setup_devices
  File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1626, in _setup_devices
    cached = self.fget(obj)
  File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1626, in _setup_devices
and (self.device.type != "cuda")
  File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1694, in device
        deepspeed.init_distributed(timeout=timedelta(seconds=self.ddp_timeout))deepspeed.init_distributed(timeout=timedelta(seconds=self.ddp_timeout))
    
  File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 624, in init_distributed
deepspeed.init_distributed(timeout=timedelta(seconds=self.ddp_timeout))  File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 624, in init_distributed

  File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 624, in init_distributed
    return self._setup_devices        cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)    

cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)  File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 60, in __init__
  File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 60, in __init__

  File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 60, in __init__

  File "/opt/venv/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in __get__
    self.init_process_group(backend, timeout, init_method, rank, world_size)    
self.init_process_group(backend, timeout, init_method, rank, world_size)      File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 86, in init_process_group

self.init_process_group(backend, timeout, init_method, rank, world_size)  File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 86, in init_process_group

  File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 86, in init_process_group
    torch.distributed.init_process_group(backend,
    torch.distributed.init_process_group(backend,  File "/opt/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 895, in init_process_group

    torch.distributed.init_process_group(backend,  File "/opt/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 895, in init_process_group

  File "/opt/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 895, in init_process_group
    cached = self.fget(obj)
  File "/opt/venv/lib/python3.9/site-packages/transformers/training_args.py", line 1626, in _setup_devices
    default_pg = _new_process_group_helper(
      File "/opt/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1009, in _new_process_group_helper
    default_pg = _new_process_group_helper(default_pg = _new_process_group_helper(

  File "/opt/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1009, in _new_process_group_helper
  File "/opt/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1009, in _new_process_group_helper
    backend_class = ProcessGroupNCCL(backend_prefix_store, group_rank, group_size, pg_options)
        RuntimeErrorbackend_class = ProcessGroupNCCL(backend_prefix_store, group_rank, group_size, pg_options)backend_class = ProcessGroupNCCL(backend_prefix_store, group_rank, group_size, pg_options): 

ProcessGroupNCCL is only supported with GPUs, no GPUs found!
RuntimeErrorRuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!: 
ProcessGroupNCCL is only supported with GPUs, no GPUs found!
    deepspeed.init_distributed(timeout=timedelta(seconds=self.ddp_timeout))
  File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 624, in init_distributed
    cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
  File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 60, in __init__
    self.init_process_group(backend, timeout, init_method, rank, world_size)
  File "/opt/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 86, in init_process_group
    torch.distributed.init_process_group(backend,
  File "/opt/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 895, in init_process_group
    default_pg = _new_process_group_helper(
  File "/opt/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1009, in _new_process_group_helper
    backend_class = ProcessGroupNCCL(backend_prefix_store, group_rank, group_size, pg_options)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9546) of binary: /opt/venv/bin/python3.9
Traceback (most recent call last):
  File "/opt/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/opt/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/opt/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED

Why did I get this error, and How can I solve it? Please help me.

MyIcecream avatar May 08 '23 06:05 MyIcecream

Same error. Any help pls? Can't we train this model on high end CPU machine?

Machine config is: Ubuntu 18.04.6 LTS CPU(s): 16 Thread(s) per core: 2 RAM: 58G

Python version: Python 3.9.16

chintan-donda avatar May 08 '23 09:05 chintan-donda

We can run it on CPU as well with below added args at the last in training command: --no_cuda --xpu_backend mpi

But I'm facing issues related to MPI: RuntimeError: Distributed package doesn't have MPI built in. MPI is only included if you build PyTorch from source on a host that has MPI installed.

Any help on this?

chintan-donda avatar May 08 '23 11:05 chintan-donda