datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Connection error of the HuggingFace's dataset Hub due to SSLError with proxy

Open leemgs opened this issue 3 years ago • 13 comments

Describe the bug

It's weird. I could not normally connect the dataset Hub of HuggingFace due to a SSLError in my office. Even when I try to connect using my company's proxy address (e.g., http_proxy and https_proxy), I'm getting the SSLError issue. What should I do to download the datanet stored in HuggingFace normally? I welcome any comments. I think those comments will be helpful to me.

  • Dataset address - https://huggingface.co/datasets/moyix/debian_csrc/viewer/moyix--debian_csrc
  • Log message
      ............ OMISSION ..............
Traceback (most recent call last):
  File "/data/home/geunsik-lim/qtlab/./transformers/examples/pytorch/language-modeling/run_clm.py", line 587, in <module>
    main()
  File "/data/home/geunsik-lim/qtlab/./transformers/examples/pytorch/language-modeling/run_clm.py", line 278, in main
    raw_datasets = load_dataset(
  File "/home/geunsik-lim/anaconda3/envs/deepspeed/lib/python3.10/site-packages/datasets/load.py", line 1719, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/geunsik-lim/anaconda3/envs/deepspeed/lib/python3.10/site-packages/datasets/load.py", line 1497, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/home/geunsik-lim/anaconda3/envs/deepspeed/lib/python3.10/site-packages/datasets/load.py", line 1222, in dataset_module_factory
    raise e1 from None
  File "/home/geunsik-lim/anaconda3/envs/deepspeed/lib/python3.10/site-packages/datasets/load.py", line 1179, in dataset_module_factory
    raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({type(e).__name__})")
ConnectionError: Couldn't reach 'moyix/debian_csrc' on the Hub (SSLError)
[2022-11-07 15:23:38,476] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 6760
[2022-11-07 15:23:38,476] [ERROR] [launch.py:324:sigkill_handler] ['/home/geunsik-lim/anaconda3/envs/deepspeed/bin/python', '-u', './transformers/examples/pytorch/language-modeling/run_clm.py', '--local_rank=0', '--model_name_or_path=Salesforce/codegen-350M-multi', '--per_device_train_batch_size=1', '--learning_rate', '2e-5', '--num_train_epochs', '1', '--output_dir=./codegen-350M-finetuned', '--overwrite_output_dir', '--dataset_name', 'moyix/debian_csrc', '--cache_dir', '/data/home/geunsik-lim/.cache', '--tokenizer_name', 'Salesforce/codegen-350M-multi', '--block_size', '2048', '--gradient_accumulation_steps', '32', '--do_train', '--fp16', '--deepspeed', 'ds_config_zero2.json'] exits with return code = 1

real    0m7.742s
user    0m4.930s

Steps to reproduce the bug

Steps to reproduce this behavior.

(deepspeed) geunsik-lim@ai02:~/qtlab$ ./test_debian_csrc_dataset.py
Traceback (most recent call last):
  File "/data/home/geunsik-lim/qtlab/./test_debian_csrc_dataset.py", line 6, in <module>
    dataset = load_dataset("moyix/debian_csrc")
  File "/home/geunsik-lim/anaconda3/envs/deepspeed/lib/python3.10/site-packages/datasets/load.py", line 1719, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/geunsik-lim/anaconda3/envs/deepspeed/lib/python3.10/site-packages/datasets/load.py", line 1497, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/home/geunsik-lim/anaconda3/envs/deepspeed/lib/python3.10/site-packages/datasets/load.py", line 1222, in dataset_module_factory
    raise e1 from None
  File "/home/geunsik-lim/anaconda3/envs/deepspeed/lib/python3.10/site-packages/datasets/load.py", line 1179, in dataset_module_factory
    raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({type(e).__name__})")
ConnectionError: Couldn't reach 'moyix/debian_csrc' on the Hub (SSLError)
(deepspeed) geunsik-lim@ai02:~/qtlab$
(deepspeed) geunsik-lim@ai02:~/qtlab$
(deepspeed) geunsik-lim@ai02:~/qtlab$
(deepspeed) geunsik-lim@ai02:~/qtlab$ cat ./test_debian_csrc_dataset.py
#!/usr/bin/env python
from datasets import load_dataset
dataset = load_dataset("moyix/debian_csrc")

  1. Adde proxy address of a company in /etc/profile
  2. Download dataset with load_dataset() function of datasets package that is provided by HuggingFace.
  3. In this case, the address would be "moyix--debian_csrc".
  4. I get the "ConnectionError: Couldn't reach 'moyix/debian_csrc' on the Hub (SSLError)" error message.

Expected behavior

  • error message: ConnectionError: Couldn't reach 'moyix/debian_csrc' on the Hub (SSLError)

Environment info

  • software version information:
(deepspeed) geunsik-lim@ai02:~$
(deepspeed) geunsik-lim@ai02:~$ conda list -f pytorch
# packages in environment at /home/geunsik-lim/anaconda3/envs/deepspeed:
#
# Name                    Version                   Build  Channel
pytorch                   1.13.0          py3.10_cuda11.7_cudnn8.5.0_0    pytorch
(deepspeed) geunsik-lim@ai02:~$ conda list -f python
# packages in environment at /home/geunsik-lim/anaconda3/envs/deepspeed:
#
# Name                    Version                   Build  Channel
python                    3.10.6               haa1d7c7_1
(deepspeed) geunsik-lim@ai02:~$ conda list -f datasets
# packages in environment at /home/geunsik-lim/anaconda3/envs/deepspeed:
#
# Name                    Version                   Build  Channel
datasets                  2.6.1                      py_0    huggingface
(deepspeed) geunsik-lim@ai02:~$ uname -a
Linux ai02 5.4.0-131-generic #147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
(deepspeed) geunsik-lim@ai02:~$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.5 LTS"

leemgs avatar Nov 07 '22 06:11 leemgs

Hi ! It looks like an issue with your python environment, can you make sure you're able to run GET requests to https://huggingface.co using requests in python ?

lhoestq avatar Nov 09 '22 13:11 lhoestq

Thanks for your reply. Does this mean that I have to use the do_dataset function and the requests function to download the dataset from the company's proxy environment?

  • Reference:
### How to load this dataset directly with the [datasets](https://github.com/huggingface/datasets) library


* https://huggingface.co/datasets/moyix/debian_csrc

* from datasets import load_dataset
dataset = load_dataset("moyix/debian_csrc")



### Or just clone the dataset repo


git lfs install
git clone https://huggingface.co/datasets/moyix/debian_csrc
# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

leemgs avatar Nov 11 '22 05:11 leemgs

You can use requests to see if downloading a file from the Hugging Face Hub works. If so, then datasets should work as well. If not, then you have to find another way using an internet connection that works

lhoestq avatar Nov 12 '22 15:11 lhoestq

I resolved this issue by applying to "unblock websites" at https://huggingface.com in a corporate network environment with a firewall.

leemgs avatar Aug 02 '23 05:08 leemgs

Hi ! It looks like an issue with your python environment, can you make sure you're able to run GET requests to https://huggingface.co using requests in python ?

yes,but still not work

image image

lonngxiang avatar Nov 17 '23 12:11 lonngxiang

I read https://github.com/huggingface/datasets/blob/main/src/datasets/load.py, it fail when get the dataset metadata, so download_config has not worked.

            hf_api = HfApi(config.HF_ENDPOINT)
            try:
                dataset_info = hf_api.dataset_info(
                    repo_id=path,
                    revision=revision,
                    token=download_config.token,
                    timeout=100.0,
                )
            except Exception as e:  # noqa catch any exception of hf_hub and consider that the dataset doesn't exist
                if isinstance(
                    e,
                    (
                        OfflineModeIsEnabled,
                        requests.exceptions.ConnectTimeout,
                        requests.exceptions.ConnectionError,
                    ),
                ):
                    raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({type(e).__name__})")

I configure the huggingface_hub api, use configure_http_backend

from huggingface_hub import configure_http_backend
def backend_factory() -> requests.Session:
    session = requests.Session()
    session.proxies = proxy
    session.verify = False
    return session

configure_http_backend(backend_factory=backend_factory)

It works.

kuikuikuizzZ avatar Dec 08 '23 05:12 kuikuikuizzZ

Even tough it does not look like a certificate error in the error message, I had the same error and adding following lines to my code solved my problem.

import os os.environ['CURL_CA_BUNDLE'] = ''

DataScientistTX avatar Jan 25 '24 19:01 DataScientistTX

@kuikuikuizzZ Could you please explain where the configuration code is added?

NoviceStone avatar Feb 28 '24 02:02 NoviceStone

Even tough it does not look like a certificate error in the error message, I had the same error and adding following lines to my code solved my problem.

import os os.environ['CURL_CA_BUNDLE'] = ''

Worked for as well! I faced the issue while submitting jobs through SLURM.

mahdibaghbanzadeh avatar Mar 20 '24 21:03 mahdibaghbanzadeh

Even tough it does not look like a certificate error in the error message, I had the same error and adding following lines to my code solved my problem.

import os os.environ['CURL_CA_BUNDLE'] = ''

doesn't work , what does this code mean?

Joeland4 avatar May 12 '24 01:05 Joeland4

If you're working on a cluster, may be that they disabled remote connections for security purposes, you will have to download the files on your local machine and then transfer them to your cluster through scp or some other transfer protocol. I know you've probably resolved the issue, but that is for anyone in the future who might stumble across this thread and needs help because I struggled with that even after reading this thread.

marcv12 avatar Jul 14 '24 16:07 marcv12

Even tough it does not look like a certificate error in the error message, I had the same error and adding following lines to my code solved my problem.

import os os.environ['CURL_CA_BUNDLE'] = ''

If this not work, try this:

export http_proxy="http://127.0.0.1:10810"
export https_proxy="http://127.0.0.1:10810"
git config --global http.proxy http://127.0.0.1:10810
git config --global https.proxy http://127.0.0.1:10810

jupyter notebook

set your proxy env first, then start notebook in this session

shafferjohn avatar Jul 16 '24 14:07 shafferjohn

If you're working on a cluster, may be that they disabled remote connections for security purposes, you will have to download the files on your local machine and then transfer them to your cluster through scp or some other transfer protocol. I know you've probably resolved the issue, but that is for anyone in the future who might stumble across this thread and needs help because I struggled with that even after reading this thread.

Thank you buddy!

Joeland4 avatar Jul 19 '24 01:07 Joeland4