datacomp Connection error while half-downloading metadata

Hello! I'm running the command:

python download_upstream.py --scale medium --data_dir medium --skip_shards

After downloading some files it interrupts with the error:

  File "/home/oleg/.local/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 94, in thread_map
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
  File "/home/oleg/.local/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 76, in _executor_map
    return list(tqdm_class(ex.map(fn, *iterables, **map_args), **kwargs))
  File "/home/oleg/.local/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/oleg/miniconda3/lib/python3.9/concurrent/futures/_base.py", line 609, in result_iterator
    yield fs.pop().result()
  File "/home/oleg/miniconda3/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/oleg/miniconda3/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/home/oleg/miniconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/oleg/.local/lib/python3.9/site-packages/huggingface_hub/_snapshot_download.py", line 211, in _inner_hf_hub_download
    return hf_hub_download(
  File "/home/oleg/.local/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/oleg/.local/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1291, in hf_hub_download
    raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: Connection error, and we cannot find the requested files in the disk cache. Please try again or make sure your Internet connection is on.

As you can see, there is not too much details in error message. May this be caused some files missing on server? Or just connection problems? If the last, how can I resume thedownload? Flag --overwrite_metadata seems not suitable because it removes all already downloaded files.

Aug 13 '23 06:08 sedol1339

I've reduced the download code to the following:

from huggingface_hub import snapshot_download
snapshot_download(**{'repo_id': 'mlfoundations/datacomp_medium', 'allow_patterns': '*.parquet', 'local_dir': 'medium/metadata', 'cache_dir': 'medium/hf', 'local_dir_use_symlinks': False, 'repo_type': 'dataset', 'resume_download': True})

Still see that even with resume_download=True it keeps downloading the same files every time after error

Aug 13 '23 09:08 sedol1339

same here, any solution?

Jan 15 '24 12:01 LengSicong

A temporary solution would be to catch the URLs that are downloaded and then download them manually.

Change download_upstream.py

# add at the beginning
class QuietTqdm(tqdm):
    def __init__(self, *a, **kw):
        kw["disable"] = True
        super().__init__(*a, **kw)

# change
    hf_snapshot_args = dict(
        repo_id=hf_repo,
        allow_patterns=f"*.parquet",
        local_dir=metadata_dir,
        cache_dir=cache_dir,
        local_dir_use_symlinks=False,
        repo_type="dataset",
        max_workers=1,
        tqdm_class=QuietTqdm,
    )

# delete this line:  print(f"Downloading metadata to {metadata_dir}...")

Find and change the file site-packages/huggingface_hub/file_download.py

Find the line 1245 and add the print and return statement

# find this line (1245)
    url = hf_hub_url(repo_id, filename, repo_type=repo_type, revision=revision, endpoint=endpoint)
# add the print and return statement
    print(url)
    return "none"

Finally call the downloader

HF_HUB_DISABLE_PROGRESS_BARS=1 python download_upstream.py --scale xlarge --data_dir data/datacomp --skip_shards > urls.txt

This gives you a list of ~24K URLs to manually download. Now you just need some sort of download utility that can batch-download URLs and you have the metadata.

Mar 02 '24 15:03 simon-ging