Qwen3 模型在sft/finetune.py中进行微调训练的时候遇到了错误

模型训练时候

[2024-06-24 10:46:00,637] torch.distributed.run: [WARNING] 
[2024-06-24 10:46:00,637] torch.distributed.run: [WARNING] *****************************************
[2024-06-24 10:46:00,637] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-06-24 10:46:00,637] torch.distributed.run: [WARNING] *****************************************
[W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[2024-06-24 10:46:02,999] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-24 10:46:03,030] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-24 10:46:03,034] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-24 10:46:03,041] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-24 10:46:03,076] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-24 10:46:03,085] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[2024-06-24 10:46:03,122] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[2024-06-24 10:46:03,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[2024-06-24 10:46:03,850] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-24 10:46:03,856] [INFO] [comm.py:637:init_distributed] cdb=None
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
Traceback (most recent call last):
  File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 383, in <module>
    train()
  File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 274, in train
    ) = parser.parse_args_into_dataclasses()
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 128, in __init__
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in __post_init__
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
    and (self.device.type != "cuda")
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device
    return self._setup_devices
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in __get__
    cached = self.fget(obj)
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2022, in _setup_devices
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
    self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 280, in __init__
    self.set_device()
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 790, in set_device
    torch.cuda.set_device(self.device)
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/cuda/__init__.py", line 408, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 383, in <module>
    train()
  File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 274, in train
    ) = parser.parse_args_into_dataclasses()
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 128, in __init__
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in __post_init__
    and (self.device.type != "cuda")
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device
    return self._setup_devices
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in __get__
    cached = self.fget(obj)
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2022, in _setup_devices
    self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 280, in __init__
    self.set_device()
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 790, in set_device
    torch.cuda.set_device(self.device)
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/cuda/__init__.py", line 408, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[2024-06-24 10:46:03,943] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-24 10:46:03,953] [INFO] [comm.py:637:init_distributed] cdb=None
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[2024-06-24 10:46:03,956] [INFO] [comm.py:637:init_distributed] cdb=None
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
Traceback (most recent call last):
  File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 383, in <module>
    train()
  File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 274, in train
    ) = parser.parse_args_into_dataclasses()
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 128, in __init__
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in __post_init__
    and (self.device.type != "cuda")
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device
    return self._setup_devices
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in __get__
    cached = self.fget(obj)
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2022, in _setup_devices
    self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 280, in __init__
    self.set_device()
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 790, in set_device
    torch.cuda.set_device(self.device)
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/cuda/__init__.py", line 408, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
Traceback (most recent call last):
  File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 383, in <module>
    train()
  File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 274, in train
    ) = parser.parse_args_into_dataclasses()
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
Traceback (most recent call last):
  File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 383, in <module>
    obj = dtype(**inputs)
  File "<string>", line 128, in __init__
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in __post_init__
    train()
  File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 274, in train
    ) = parser.parse_args_into_dataclasses()
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 128, in __init__
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in __post_init__
    and (self.device.type != "cuda")
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device
    and (self.device.type != "cuda")
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device
    return self._setup_devices
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in __get__
    cached = self.fget(obj)
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2022, in _setup_devices
    return self._setup_devices
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in __get__
    cached = self.fget(obj)
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2022, in _setup_devices
    self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 280, in __init__
    self.set_device()
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 790, in set_device
    torch.cuda.set_device(self.device)
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/cuda/__init__.py", line 408, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

    self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 280, in __init__
    self.set_device()
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 790, in set_device
    torch.cuda.set_device(self.device)
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/cuda/__init__.py", line 408, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[2024-06-24 10:46:03,997] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-24 10:46:03,997] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-06-24 10:46:04,000] [INFO] [comm.py:637:init_distributed] cdb=None
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
Traceback (most recent call last):
  File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 383, in <module>
    train()
  File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 274, in train
    ) = parser.parse_args_into_dataclasses()
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 128, in __init__
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in __post_init__
    and (self.device.type != "cuda")
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device
    return self._setup_devices
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in __get__
    cached = self.fget(obj)
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2022, in _setup_devices
    self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 280, in __init__
    self.set_device()
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 790, in set_device
    torch.cuda.set_device(self.device)
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/cuda/__init__.py", line 408, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[2024-06-24 10:46:04,067] [INFO] [comm.py:637:init_distributed] cdb=None
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
Traceback (most recent call last):
  File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 383, in <module>
    train()
  File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 274, in train
    ) = parser.parse_args_into_dataclasses()
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 128, in __init__
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in __post_init__
    and (self.device.type != "cuda")
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device
    return self._setup_devices
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in __get__
    cached = self.fget(obj)
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2022, in _setup_devices
    self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 280, in __init__
    self.set_device()
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 790, in set_device
    torch.cuda.set_device(self.device)
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/cuda/__init__.py", line 408, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 383, in <module>
    train()
  File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 274, in train
    ) = parser.parse_args_into_dataclasses()
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 128, in __init__
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in __post_init__
    and (self.device.type != "cuda")
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device
    return self._setup_devices
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in __get__
    cached = self.fget(obj)
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2022, in _setup_devices
    self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 292, in __init__
    raise NotImplementedError(
NotImplementedError: Using RTX 4000 series doesn't support faster communication broadband via P2P or IB. Please set `NCCL_P2P_DISABLE="1"` and `NCCL_IB_DISABLE="1" or use `accelerate launch` which will do this automatically.
[2024-06-24 10:46:05,658] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2984145) of binary: /data2/caomy/envs/Qqwen2/bin/python
Traceback (most recent call last):
  File "/data2/caomy/envs/Qqwen2/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.2.1', 'console_scripts', 'torchrun')())
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-06-24_10:46:05
  host      : gpu-15.ld-hadoop.com
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2984146)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-06-24_10:46:05
  host      : gpu-15.ld-hadoop.com
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 2984147)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-06-24_10:46:05
  host      : gpu-15.ld-hadoop.com
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 2984148)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-06-24_10:46:05
  host      : gpu-15.ld-hadoop.com
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 2984149)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-06-24_10:46:05
  host      : gpu-15.ld-hadoop.com
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 2984150)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-06-24_10:46:05
  host      : gpu-15.ld-hadoop.com
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 2984151)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2024-06-24_10:46:05
  host      : gpu-15.ld-hadoop.com
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 2984152)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-24_10:46:05
  host      : gpu-15.ld-hadoop.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2984145)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Jun 24 '24 03:06 CmyLjj

[W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
...
RuntimeError: CUDA error: invalid device ordinal
...
NotImplementedError: Using RTX 4000 series doesn't support faster communication broadband via P2P or IB. Please set `NCCL_P2P_DISABLE="1"` and `NCCL_IB_DISABLE="1" or use `accelerate launch` which will do this automatically.

It appears that the network backend has failed somehow. Please check if the DDP related arguments are set properly or report to accelerate for their assistance in debugging your environment.

Jun 24 '24 04:06 jklj077

请问解决了吗？

Jul 16 '24 05:07 zzhiyun

This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.

Aug 29 '24 08:08 github-actions[bot]

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Feb 25 '25 08:02 github-actions[bot]