huggingface_hub icon indicating copy to clipboard operation
huggingface_hub copied to clipboard

huggingface-cli: [Error 104] Connection reset by peer - crashes application

Open Galunid opened this issue 11 months ago • 6 comments

Describe the bug

When downloading large models (~72B parameters+) using huggingface-cli command it's common to have download interrupted by [Error 104] Connection reset by peer. This happens to crash application rather than simply retry after some delay. The command then has to be restarted manually.

Reproduction

Run the following command: huggingface-cli download --local-dir Qwen2.5-72B-Instruct Qwen/Qwen2.5-72B-Instruct

Logs

Traceback (most recent call last): 89%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                 | 3.57G/4.00G [23:56<02:26, 2.94MB/s]
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/urllib3/connectionpool.py", line 787, in urlopen████████████████████████████████████████████████████████████████████▉| 4.00G/4.00G [25:13<00:00, 9.88MB/s]
    response = self._make_request( 90%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                | 3.60G/4.00G [24:03<01:38, 4.04MB/s]
        conn,
    ...<10 lines>....safetensors:  94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊         | 3.76G/4.00G [24:23<00:26, 8.68MB/s]
        **response_kw,
    )-00022-of-00037.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.00G/4.00G [24:36<00:00, 17.4MB/s]
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/urllib3/connectionpool.py", line 534, in _make_request
    response = conn.getresponse()
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/urllib3/connection.py", line 516, in getresponse
    httplib_response = super().getresponse()
  File "/usr/lib/python3.13/http/client.py", line 1428, in getresponse
    response.begin()
    ~~~~~~~~~~~~~~^^
  File "/usr/lib/python3.13/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ~~~~~~~~~~~~~~~~~^^
  File "/usr/lib/python3.13/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "/usr/lib/python3.13/socket.py", line 719, in readinto
    return self._sock.recv_into(b)
           ~~~~~~~~~~~~~~~~~~~~^^^
  File "/usr/lib/python3.13/ssl.py", line 1304, in recv_into
    return self.read(nbytes, buffer)
           ~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.13/ssl.py", line 1138, in read
    return self._sslobj.read(len, buffer)
           ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
        method=request.method,
    ...<9 lines>...
        chunked=chunked,
    )
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/urllib3/connectionpool.py", line 841, in urlopen
    retries = retries.increment(
        method, url, error=new_e, _pool=self, _stacktrace=sys.exc_info()[2]
    )
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/urllib3/util/retry.py", line 474, in increment
    raise reraise(type(error), error, _stacktrace)
          ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/urllib3/util/util.py", line 38, in reraise
    raise value.with_traceback(tb)
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
        conn,
    ...<10 lines>...
        **response_kw,
    )
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/urllib3/connectionpool.py", line 534, in _make_request
    response = conn.getresponse()
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/urllib3/connection.py", line 516, in getresponse
    httplib_response = super().getresponse()
  File "/usr/lib/python3.13/http/client.py", line 1428, in getresponse
    response.begin()
    ~~~~~~~~~~~~~~^^
  File "/usr/lib/python3.13/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ~~~~~~~~~~~~~~~~~^^
  File "/usr/lib/python3.13/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "/usr/lib/python3.13/socket.py", line 719, in readinto
    return self._sock.recv_into(b)
           ~~~~~~~~~~~~~~~~~~~~^^^
  File "/usr/lib/python3.13/ssl.py", line 1304, in recv_into
    return self.read(nbytes, buffer)
           ~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.13/ssl.py", line 1138, in read
    return self._sslobj.read(len, buffer)
           ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 1374, in _get_metadata_or_catch_error
    metadata = get_hf_file_metadata(
        url=url, proxies=proxies, timeout=etag_timeout, headers=headers, token=token
    )
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 1294, in get_hf_file_metadata
    r = _request_wrapper(
        method="HEAD",
    ...<5 lines>...
        timeout=timeout,
    )
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 278, in _request_wrapper
    response = _request_wrapper(
        method=method,
    ...<2 lines>...
        **params,
    )
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 301, in _request_wrapper
    response = get_session().request(method=method, url=url, **params)
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/huggingface_hub/utils/_http.py", line 93, in send
    return super().send(request, *args, **kwargs)
           ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/requests/adapters.py", line 682, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: (ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')), '(Request ID: d6845208-9703-4264-891a-4bfa9237024f)')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/kris/.local/bin/huggingface-cli", line 8, in <module>
    sys.exit(main())
             ~~~~^^
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/huggingface_hub/commands/huggingface_cli.py", line 57, in main
    service.run()
    ~~~~~~~~~~~^^
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/huggingface_hub/commands/download.py", line 153, in run
    print(self._download())  # Print path to downloaded files
          ~~~~~~~~~~~~~~^^
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/huggingface_hub/commands/download.py", line 187, in _download
    return snapshot_download(
        repo_id=self.repo_id,
    ...<10 lines>...
        max_workers=self.max_workers,
    )
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/huggingface_hub/_snapshot_download.py", line 296, in snapshot_download
    thread_map(
    ~~~~~~~~~~^
        _inner_hf_hub_download,
        ^^^^^^^^^^^^^^^^^^^^^^^
    ...<4 lines>...
        tqdm_class=tqdm_class or hf_tqdm,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
               ^^^^^^^^
  File "/usr/lib/python3.13/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ~~~~~~~~~~~~~~~~~^^^^^^^^^^
  File "/usr/lib/python3.13/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ~~~~~~~~~~^^^^^^^^^
  File "/usr/lib/python3.13/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ~~~~~~~~~~~~~~~~~^^
  File "/usr/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/lib/python3.13/concurrent/futures/thread.py", line 59, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/huggingface_hub/_snapshot_download.py", line 270, in _inner_hf_hub_download
    return hf_hub_download(
        repo_id,
    ...<15 lines>...
        headers=headers,
    )
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 840, in hf_hub_download
    return _hf_hub_download_to_local_dir(
        # Destination
    ...<15 lines>...
        local_files_only=local_files_only,
    )
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 1089, in _hf_hub_download_to_local_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kris/.local/pipx/venvs/huggingface-hub/lib/python3.13/site-packages/huggingface_hub/file_download.py", line 1485, in _raise_on_head_call_error
    raise LocalEntryNotFoundError(
    ...<3 lines>...
    ) from head_call_error
huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

System info

- huggingface_hub version: 0.28.1
- Platform: Linux-6.12.10-arch1-1-x86_64-with-glibc2.40
- Python version: 3.13.1
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Running in Google Colab Enterprise ?: No
- Token path ?: /home/kris/.cache/huggingface/token
- Has saved token ?: True
- Who am I ?: Galunid
- Configured git credential helpers: store
- FastAI: N/A
- Tensorflow: N/A
- Torch: N/A
- Jinja2: N/A
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: N/A
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: N/A
- pydantic: N/A
- aiohttp: N/A
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /home/kris/.cache/huggingface/hub
- HF_ASSETS_CACHE: /home/kris/.cache/huggingface/assets
- HF_TOKEN_PATH: /home/kris/.cache/huggingface/token
- HF_STORED_TOKENS_PATH: /home/kris/.cache/huggingface/stored_tokens
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10

Galunid avatar Feb 02 '25 00:02 Galunid

Hi @Galunid, Sorry for this inconvenience. These connection reset errors are most of the time not deterministic and can be caused by various factors. It can come from a temporary network outage (unstable internet connection). Just in case, is your network behind a proxy or protected by a firewall? You can retry downloading the files again but to be honest, it's kinda hard to investigate and give more specific guidance on this 😕

hanouticelina avatar Feb 03 '25 09:02 hanouticelina

Hi, there's no proxy/firewall.

Galunid avatar Feb 05 '25 15:02 Galunid

@Galunid is your problem resolved? What was the root cause of the problem, I have the same issue.

manitadayon avatar Mar 23 '25 09:03 manitadayon

@manitadayon I think there's issue in how huggingface-cli handles exceptions. I wrote a small patch to file_download.py:

def retry_download(local_dir, repo_id, repo_type, filename, revision, endpoint, etag_timeout, hf_headers, proxies, token, cache_dir, force_download, local_files_only):
    try:
        return _hf_hub_download_to_local_dir(
                # Destination
                local_dir=local_dir,
                # File info
                repo_id=repo_id,
                repo_type=repo_type,
                filename=filename,
                revision=revision,
                # HTTP info
                endpoint=endpoint,
                etag_timeout=etag_timeout,
                headers=hf_headers,
                proxies=proxies,
                token=token,
                # Additional options
                cache_dir=cache_dir,
                force_download=force_download,
                local_files_only=local_files_only,
            )
    except:
        logger.info("Restaring download!")
        time.sleep(60)
        retry_download(local_dir, repo_id, repo_type, filename, revision, endpoint, etag_timeout, hf_headers, proxies, token, cache_dir, force_download, local_files_only)

and changed hf_hub_download function like so:

    if local_dir is not None:
        if local_dir_use_symlinks != "auto":
            warnings.warn(
                "`local_dir_use_symlinks` parameter is deprecated and will be ignored. "
                "The process to download files to a local folder has been updated and do "
                "not rely on symlinks anymore. You only need to pass a destination folder "
                "as`local_dir`.\n"
                "For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder."
            )
        
        return retry_download(local_dir, repo_id, repo_type, filename, revision, endpoint, etag_timeout, hf_headers, proxies, token, cache_dir, force_download, local_files_only)
#         return _hf_hub_download_to_local_dir(
#             # Destination
#             local_dir=local_dir,
#             # File info
#             repo_id=repo_id,
#             repo_type=repo_type,
#             filename=filename,
#             revision=revision,
#             # HTTP info
#             endpoint=endpoint,
#             etag_timeout=etag_timeout,
#             headers=hf_headers,
#             proxies=proxies,
#             token=token,
#             # Additional options
#             cache_dir=cache_dir,
#             force_download=force_download,
#             local_files_only=local_files_only,
#         )

This solution is good enough for me, but not good enough to contribute. Rather than catching all exceptions you can probably just catch LocalEntryNotFoundError.

Galunid avatar Mar 23 '25 18:03 Galunid

@Galunid Thanks, any idea why this is happening, it started to happen to me since a couple of days ago. The problem only exists on some platforms(such as databricks) but not on google Colab using the same internet connection.

manitadayon avatar Mar 23 '25 18:03 manitadayon

@manitadayon It's hard to say. Either your network connection is not that great and huggingface server decides it was dropped on your side and closes it, or there's some issue on the server side. I'd assume it's more likely problem with your connection, or in this case databricks -> huggingface. It's unlikely you can do anything to not get those errors. I think the sane solution is to simply wait a bit and retry which is what my code does. Obviously it lacks certain "features" such as exitting after n retries, handling only specific exceptions, doing the same for cache dir download and so on.

Galunid avatar Mar 23 '25 22:03 Galunid