KG_RAG icon indicating copy to clipboard operation
KG_RAG copied to clipboard

OSError: [Errno 101] Network is unreachable

Open PetrichorHlacyon opened this issue 7 months ago • 0 comments

When I execute the sft (used in large language model), the error occur. I don't know how to solve it. I will be glad to see your help!

[2025-03-29 11:31:28,625] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2025-03-29 11:31:28,625] [INFO] [comm.py:616:init_distributed] cdb=None [2025-03-29 11:31:28,625] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [rank0]: Traceback (most recent call last): [rank0]: File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 169, in _new_conn [rank0]: conn = connection.create_connection( [rank0]: File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 96, in create_connection [rank0]: raise err [rank0]: File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 86, in create_connection [rank0]: sock.connect(sa) [rank0]: OSError: [Errno 101] Network is unreachable

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last): [rank0]: File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 700, in urlopen [rank0]: httplib_response = self._make_request( [rank0]: File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 383, in _make_request [rank0]: self._validate_conn(conn) [rank0]: File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 1017, in _validate_conn [rank0]: conn.connect() [rank0]: File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 353, in connect [rank0]: conn = self._new_conn() [rank0]: File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 181, in _new_conn [rank0]: raise NewConnectionError( [rank0]: urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7fdc363a0c70>: Failed to establish a new connection: [Errno 101] Network is unreachable

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last): [rank0]: File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 667, in send [rank0]: resp = conn.urlopen( [rank0]: File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 756, in urlopen [rank0]: retries = retries.increment( [rank0]: File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 576, in increment [rank0]: raise MaxRetryError(_pool, url, error or ResponseError(cause)) [rank0]: urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /Qwen1.5-14B-Chat/resolve/main/config.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fdc363a0c70>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last): [rank0]: File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1376, in _get_metadata_or_catch_error [rank0]: metadata = get_hf_file_metadata( [rank0]: File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn [rank0]: return fn(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1296, in get_hf_file_metadata [rank0]: r = _request_wrapper( [rank0]: File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 280, in _request_wrapper [rank0]: response = _request_wrapper( [rank0]: File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 303, in _request_wrapper [rank0]: response = get_session().request(method=method, url=url, **params) [rank0]: File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 589, in request [rank0]: resp = self.send(prep, **send_kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 703, in send [rank0]: r = adapter.send(request, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_http.py", line 96, in send [rank0]: return super().send(request, *args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/requests/adapters.py", line 700, in send [rank0]: raise ConnectionError(e, request=request) [rank0]: requests.exceptions.ConnectionError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /Qwen1.5-14B-Chat/resolve/main/config.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fdc363a0c70>: Failed to establish a new connection: [Errno 101] Network is unreachable'))"), '(Request ID: 34b892b5-cb11-49f0-9b0e-6d5eedc96f0f)')

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last): [rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 402, in cached_file [rank0]: resolved_file = hf_hub_download( [rank0]: File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn [rank0]: return fn(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 862, in hf_hub_download [rank0]: return _hf_hub_download_to_cache_dir( [rank0]: File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 969, in _hf_hub_download_to_cache_dir [rank0]: _raise_on_head_call_error(head_call_error, force_download, local_files_only) [rank0]: File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1489, in _raise_on_head_call_error [rank0]: raise LocalEntryNotFoundError( [rank0]: huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last): [rank0]: File "/usr/LLMOPT/LLMOPT/./sft/sft.py", line 419, in [rank0]: train() [rank0]: File "/usr/LLMOPT/LLMOPT/./sft/sft.py", line 338, in train [rank0]: config = transformers.AutoConfig.from_pretrained( [rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 965, in from_pretrained [rank0]: config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 632, in get_config_dict [rank0]: config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 689, in _get_config_dict [rank0]: resolved_config_file = cached_file( [rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 445, in cached_file [rank0]: raise EnvironmentError( [rank0]: OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like Qwen1.5-14B-Chat is not the path to a directory containing a file named config.json. [rank0]: Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'. [rank0]:[W329 11:33:59.322648127 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) E0329 11:33:59.797000 1262 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1294) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/home/blueart/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper return f(*args, **kwargs) File "/home/blueart/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main run(args) File "/home/blueart/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run elastic_launch( File "/home/blueart/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/blueart/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

PetrichorHlacyon avatar Mar 29 '25 03:03 PetrichorHlacyon