Checksum Failure when Downloading Datasets from Seqio Mixture
Short description In a Google Research repository, several users are hitting an issue where they get checksum errors for TFDS datasets they need to download (e.g. see bottom of this thread), because the checksum has changed. It isn't clear to me how this is intended to be avoided.
Environment information
-
Operating System: Mac OS
-
Python version: 3.7
-
tensorflow-datasets/tfds-nightlyversion: 4.4.0.dev202108200109 -
tensorflow/tf-nightlyversion: 2.6.0 -
Does the issue still exists with the last
tfds-nightlypackage (pip install --upgrade tfds-nightly) ? -
Yes
Reproduction instructions
Clone this Google repository. Then in python flan/v2/run_example.py comment in line 86.
pip install --upgrade pip
pip install -r flan/v2/requirements.txt
PYTHONPATH=. python flan/v2/run_example.py
If you share a colab, make sure to update the permissions to share it.
Link to logs
ERROR:absl:Failed to load task 'aeslc_template_0to10_no_opt_x_shot' as part of mixture 'flan2022_submix'
Traceback (most recent call last):
File "/home/henry/flan/FLAN/flan/v2/run_example.py", line 93, in <module>
dataset = selected_mixture.get_dataset(
File "/home/henry/anaconda3/envs/palm/lib/python3.9/site-packages/seqio/dataset_providers.py", line 1730, in get_dataset
ds = task.get_dataset(
File "/home/henry/anaconda3/envs/palm/lib/python3.9/site-packages/seqio/dataset_providers.py", line 1375, in get_dataset
ds = source.get_dataset(
File "/home/henry/anaconda3/envs/palm/lib/python3.9/site-packages/seqio/experimental.py", line 369, in get_dataset
train_ds = _get_maybe_sharded_dataset(
File "/home/henry/anaconda3/envs/palm/lib/python3.9/site-packages/seqio/experimental.py", line 329, in _get_maybe_sharded_dataset
num_shards = len(self._original_source.list_shards(split_))
File "/home/henry/anaconda3/envs/palm/lib/python3.9/site-packages/seqio/dataset_providers.py", line 510, in list_shards
return [_get_filename(info) for info in self.tfds_dataset.files(split)]
File "/home/henry/anaconda3/envs/palm/lib/python3.9/site-packages/seqio/utils.py", line 159, in files
split_info = self.builder.info.splits[split]
File "/home/henry/anaconda3/envs/palm/lib/python3.9/site-packages/tensorflow_datasets/core/splits.py", line 391, in __getitem__
raise KeyError(
KeyError: "Trying to access `splits['train']` but `splits` is empty. This likely indicate the dataset has not been generated yet."
Expected behavior To download TFDS dataset successfully.
The SeqIO get_dataset function in the TFDS data source calls tfds.load which makes sure the dataset is downloaded and prepared. However, in _get_maybe_sharded_dataset in seqio/experimental.py it is getting the shards before calling get_dataset, and it requires the dataset to be downloaded for this. I'll fix this in SeqIO.
Hi @tomvdw, it would be great if you can let us know an estimated time for this fix?
I have submitted a fix in TFDS. Could you retry with tfds-nightly? If it's working, we'll release a new version of TFDS.
I have submitted a fix in TFDS. Could you retry with tfds-nightly? If it's working, we'll release a new version of TFDS.
It seems that using tfds-nightly 4.8.3.dev202303300044 gives the same error.