datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Librispeech download fails

Open albertz opened this issue 3 years ago • 1 comments
trafficstars

I'm not sure if this counts as a bug or feature request.

I'm trying:

tfds.load("librispeech")

It fails with:

ConnectionError: HTTPConnectionPool(host='www.openslr.org', port=80): Max retries exceeded with url: /resources/12/dev-other.tar.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fdb6d122640>: Failed to establish a new connection: [Errno 111] Connection refused'))

Setting max_retries=10 or so on the HTTPAdapter of requests might solve the issue.

The default max_simultaneous_downloads=50 in the _Downloader might also be problematic here. I'm not sure that openslr.org allows so many simultaneous connections, so this would always lead to a failed download then, so this is a bug.

On the homepage, it says:

If you want to download things from this site, please download them one at a time, and please don't use any fancy software-- just download things from your browser or use 'wget'. We have a firewall rule to drop connections from hosts with more than 5 simultaneous connections, and certain types of download software may activate this rule.

So it must be max_simultaneous_downloads=1 here. I think this is a bug then.

albertz avatar Apr 15 '22 23:04 albertz

Note, I'm now using this extremely ugly hacky monkey patch:

orig_get_downloader = tfds.download.downloader.get_downloader

def _patched_get_downloader(*args, **kwargs):
    kwargs.setdefault("max_simultaneous_downloads", 1)
    return orig_get_downloader(*args, **kwargs)

tfds.download.downloader.get_downloader = _patched_get_downloader

This seems to work. But obviously this is not really a solution.

albertz avatar Apr 15 '22 23:04 albertz

@albertz hi! I hope this commit: https://github.com/tensorflow/datasets/commit/bacae741bbbaed39ef2e5fca421aee29727ada59 has fixed your issue.

The default max_simultaneous_downloads for librispeech dataset is now 5, but you could also overwrite it with tfds.load("librispeech", download_and_prepare_kwargs={'download_config': tfds.download.DownloadConfig(override_max_simultaneous_downloads=...)})

fineguy avatar Nov 22 '22 13:11 fineguy