Connection error while half-downloading metadata
Hello! I'm running the command:
python download_upstream.py --scale medium --data_dir medium --skip_shards
After downloading some files it interrupts with the error:
File "/home/oleg/.local/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 94, in thread_map
return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
File "/home/oleg/.local/lib/python3.9/site-packages/tqdm/contrib/concurrent.py", line 76, in _executor_map
return list(tqdm_class(ex.map(fn, *iterables, **map_args), **kwargs))
File "/home/oleg/.local/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "/home/oleg/miniconda3/lib/python3.9/concurrent/futures/_base.py", line 609, in result_iterator
yield fs.pop().result()
File "/home/oleg/miniconda3/lib/python3.9/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/home/oleg/miniconda3/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
File "/home/oleg/miniconda3/lib/python3.9/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/oleg/.local/lib/python3.9/site-packages/huggingface_hub/_snapshot_download.py", line 211, in _inner_hf_hub_download
return hf_hub_download(
File "/home/oleg/.local/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/home/oleg/.local/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1291, in hf_hub_download
raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: Connection error, and we cannot find the requested files in the disk cache. Please try again or make sure your Internet connection is on.
As you can see, there is not too much details in error message. May this be caused some files missing on server? Or just connection problems? If the last, how can I resume thedownload? Flag --overwrite_metadata seems not suitable because it removes all already downloaded files.
I've reduced the download code to the following:
from huggingface_hub import snapshot_download
snapshot_download(**{'repo_id': 'mlfoundations/datacomp_medium', 'allow_patterns': '*.parquet', 'local_dir': 'medium/metadata', 'cache_dir': 'medium/hf', 'local_dir_use_symlinks': False, 'repo_type': 'dataset', 'resume_download': True})
Still see that even with resume_download=True it keeps downloading the same files every time after error
same here, any solution?
A temporary solution would be to catch the URLs that are downloaded and then download them manually.
Change download_upstream.py
# add at the beginning
class QuietTqdm(tqdm):
def __init__(self, *a, **kw):
kw["disable"] = True
super().__init__(*a, **kw)
# change
hf_snapshot_args = dict(
repo_id=hf_repo,
allow_patterns=f"*.parquet",
local_dir=metadata_dir,
cache_dir=cache_dir,
local_dir_use_symlinks=False,
repo_type="dataset",
max_workers=1,
tqdm_class=QuietTqdm,
)
# delete this line: print(f"Downloading metadata to {metadata_dir}...")
Find and change the file site-packages/huggingface_hub/file_download.py
Find the line 1245 and add the print and return statement
# find this line (1245)
url = hf_hub_url(repo_id, filename, repo_type=repo_type, revision=revision, endpoint=endpoint)
# add the print and return statement
print(url)
return "none"
Finally call the downloader
HF_HUB_DISABLE_PROGRESS_BARS=1 python download_upstream.py --scale xlarge --data_dir data/datacomp --skip_shards > urls.txt
This gives you a list of ~24K URLs to manually download. Now you just need some sort of download utility that can batch-download URLs and you have the metadata.