Qwen3
Qwen3 copied to clipboard
模型在sft/finetune.py中进行微调训练的时候遇到了错误
模型训练时候
[2024-06-24 10:46:00,637] torch.distributed.run: [WARNING]
[2024-06-24 10:46:00,637] torch.distributed.run: [WARNING] *****************************************
[2024-06-24 10:46:00,637] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-06-24 10:46:00,637] torch.distributed.run: [WARNING] *****************************************
[W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[2024-06-24 10:46:02,999] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-24 10:46:03,030] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-24 10:46:03,034] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-24 10:46:03,041] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-24 10:46:03,076] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-24 10:46:03,085] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[2024-06-24 10:46:03,122] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[2024-06-24 10:46:03,161] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
[WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
[WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
[WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
[WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
[WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
[WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
[WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
[WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible
/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-06-24 10:46:03,850] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-24 10:46:03,856] [INFO] [comm.py:637:init_distributed] cdb=None
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
Traceback (most recent call last):
File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 383, in <module>
train()
File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 274, in train
) = parser.parse_args_into_dataclasses()
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "<string>", line 128, in __init__
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in __post_init__
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
and (self.device.type != "cuda")
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device
return self._setup_devices
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in __get__
cached = self.fget(obj)
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2022, in _setup_devices
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 280, in __init__
self.set_device()
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 790, in set_device
torch.cuda.set_device(self.device)
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/cuda/__init__.py", line 408, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 383, in <module>
train()
File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 274, in train
) = parser.parse_args_into_dataclasses()
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "<string>", line 128, in __init__
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in __post_init__
and (self.device.type != "cuda")
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device
return self._setup_devices
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in __get__
cached = self.fget(obj)
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2022, in _setup_devices
self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 280, in __init__
self.set_device()
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 790, in set_device
torch.cuda.set_device(self.device)
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/cuda/__init__.py", line 408, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-06-24 10:46:03,943] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-24 10:46:03,953] [INFO] [comm.py:637:init_distributed] cdb=None
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[2024-06-24 10:46:03,956] [INFO] [comm.py:637:init_distributed] cdb=None
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
Traceback (most recent call last):
File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 383, in <module>
train()
File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 274, in train
) = parser.parse_args_into_dataclasses()
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "<string>", line 128, in __init__
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in __post_init__
and (self.device.type != "cuda")
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device
return self._setup_devices
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in __get__
cached = self.fget(obj)
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2022, in _setup_devices
self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 280, in __init__
self.set_device()
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 790, in set_device
torch.cuda.set_device(self.device)
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/cuda/__init__.py", line 408, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
Traceback (most recent call last):
File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 383, in <module>
train()
File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 274, in train
) = parser.parse_args_into_dataclasses()
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
Traceback (most recent call last):
File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 383, in <module>
obj = dtype(**inputs)
File "<string>", line 128, in __init__
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in __post_init__
train()
File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 274, in train
) = parser.parse_args_into_dataclasses()
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "<string>", line 128, in __init__
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in __post_init__
and (self.device.type != "cuda")
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device
and (self.device.type != "cuda")
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device
return self._setup_devices
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in __get__
cached = self.fget(obj)
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2022, in _setup_devices
return self._setup_devices
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in __get__
cached = self.fget(obj)
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2022, in _setup_devices
self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 280, in __init__
self.set_device()
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 790, in set_device
torch.cuda.set_device(self.device)
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/cuda/__init__.py", line 408, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 280, in __init__
self.set_device()
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 790, in set_device
torch.cuda.set_device(self.device)
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/cuda/__init__.py", line 408, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-06-24 10:46:03,997] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-24 10:46:03,997] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-06-24 10:46:04,000] [INFO] [comm.py:637:init_distributed] cdb=None
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
Traceback (most recent call last):
File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 383, in <module>
train()
File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 274, in train
) = parser.parse_args_into_dataclasses()
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "<string>", line 128, in __init__
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in __post_init__
and (self.device.type != "cuda")
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device
return self._setup_devices
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in __get__
cached = self.fget(obj)
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2022, in _setup_devices
self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 280, in __init__
self.set_device()
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 790, in set_device
torch.cuda.set_device(self.device)
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/cuda/__init__.py", line 408, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-06-24 10:46:04,067] [INFO] [comm.py:637:init_distributed] cdb=None
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
Traceback (most recent call last):
File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 383, in <module>
train()
File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 274, in train
) = parser.parse_args_into_dataclasses()
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "<string>", line 128, in __init__
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in __post_init__
and (self.device.type != "cuda")
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device
return self._setup_devices
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in __get__
cached = self.fget(obj)
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2022, in _setup_devices
self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 280, in __init__
self.set_device()
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 790, in set_device
torch.cuda.set_device(self.device)
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/cuda/__init__.py", line 408, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 383, in <module>
train()
File "/data1/caomy/Qwen_information/Qwen2/examples/sft/finetune.py", line 274, in train
) = parser.parse_args_into_dataclasses()
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "<string>", line 128, in __init__
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 1605, in __post_init__
and (self.device.type != "cuda")
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2094, in device
return self._setup_devices
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/utils/generic.py", line 63, in __get__
cached = self.fget(obj)
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/transformers/training_args.py", line 2022, in _setup_devices
self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/accelerate/state.py", line 292, in __init__
raise NotImplementedError(
NotImplementedError: Using RTX 4000 series doesn't support faster communication broadband via P2P or IB. Please set `NCCL_P2P_DISABLE="1"` and `NCCL_IB_DISABLE="1" or use `accelerate launch` which will do this automatically.
[2024-06-24 10:46:05,658] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2984145) of binary: /data2/caomy/envs/Qqwen2/bin/python
Traceback (most recent call last):
File "/data2/caomy/envs/Qqwen2/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.2.1', 'console_scripts', 'torchrun')())
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data2/caomy/envs/Qqwen2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-06-24_10:46:05
host : gpu-15.ld-hadoop.com
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2984146)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-06-24_10:46:05
host : gpu-15.ld-hadoop.com
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 2984147)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-06-24_10:46:05
host : gpu-15.ld-hadoop.com
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 2984148)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2024-06-24_10:46:05
host : gpu-15.ld-hadoop.com
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 2984149)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2024-06-24_10:46:05
host : gpu-15.ld-hadoop.com
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 2984150)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
time : 2024-06-24_10:46:05
host : gpu-15.ld-hadoop.com
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 2984151)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
time : 2024-06-24_10:46:05
host : gpu-15.ld-hadoop.com
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 2984152)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-06-24_10:46:05
host : gpu-15.ld-hadoop.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2984145)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:6001 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:6001 (errno: 97 - Address family not supported by protocol).
...
RuntimeError: CUDA error: invalid device ordinal
...
NotImplementedError: Using RTX 4000 series doesn't support faster communication broadband via P2P or IB. Please set `NCCL_P2P_DISABLE="1"` and `NCCL_IB_DISABLE="1" or use `accelerate launch` which will do this automatically.
It appears that the network backend has failed somehow. Please check if the DDP related arguments are set properly or report to accelerate for their assistance in debugging your environment.
请问解决了吗?
This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.