datasets
datasets copied to clipboard
Librispeech download fails
I'm not sure if this counts as a bug or feature request.
I'm trying:
tfds.load("librispeech")
It fails with:
ConnectionError: HTTPConnectionPool(host='www.openslr.org', port=80): Max retries exceeded with url: /resources/12/dev-other.tar.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fdb6d122640>: Failed to establish a new connection: [Errno 111] Connection refused'))
Setting max_retries=10 or so on the HTTPAdapter of requests might solve the issue.
The default max_simultaneous_downloads=50 in the _Downloader might also be problematic here. I'm not sure that openslr.org allows so many simultaneous connections, so this would always lead to a failed download then, so this is a bug.
On the homepage, it says:
If you want to download things from this site, please download them one at a time, and please don't use any fancy software-- just download things from your browser or use 'wget'. We have a firewall rule to drop connections from hosts with more than 5 simultaneous connections, and certain types of download software may activate this rule.
So it must be max_simultaneous_downloads=1 here. I think this is a bug then.
Note, I'm now using this extremely ugly hacky monkey patch:
orig_get_downloader = tfds.download.downloader.get_downloader
def _patched_get_downloader(*args, **kwargs):
kwargs.setdefault("max_simultaneous_downloads", 1)
return orig_get_downloader(*args, **kwargs)
tfds.download.downloader.get_downloader = _patched_get_downloader
This seems to work. But obviously this is not really a solution.
@albertz hi! I hope this commit: https://github.com/tensorflow/datasets/commit/bacae741bbbaed39ef2e5fca421aee29727ada59 has fixed your issue.
The default max_simultaneous_downloads for librispeech dataset is now 5, but you could also overwrite it with tfds.load("librispeech", download_and_prepare_kwargs={'download_config': tfds.download.DownloadConfig(override_max_simultaneous_downloads=...)})