datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Checksum Failure when Downloading Datasets from Seqio Mixture

Open shayne-longpre opened this issue 2 years ago • 4 comments

Short description In a Google Research repository, several users are hitting an issue where they get checksum errors for TFDS datasets they need to download (e.g. see bottom of this thread), because the checksum has changed. It isn't clear to me how this is intended to be avoided.

Environment information

  • Operating System: Mac OS

  • Python version: 3.7

  • tensorflow-datasets/tfds-nightly version: 4.4.0.dev202108200109

  • tensorflow/tf-nightly version: 2.6.0

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ?

  • Yes

Reproduction instructions

Clone this Google repository. Then in python flan/v2/run_example.py comment in line 86.

pip install --upgrade pip
pip install -r flan/v2/requirements.txt
PYTHONPATH=. python flan/v2/run_example.py

If you share a colab, make sure to update the permissions to share it.

Link to logs

ERROR:absl:Failed to load task 'aeslc_template_0to10_no_opt_x_shot' as part of mixture 'flan2022_submix'
Traceback (most recent call last):
  File "/home/henry/flan/FLAN/flan/v2/run_example.py", line 93, in <module>
    dataset = selected_mixture.get_dataset(
  File "/home/henry/anaconda3/envs/palm/lib/python3.9/site-packages/seqio/dataset_providers.py", line 1730, in get_dataset
    ds = task.get_dataset(
  File "/home/henry/anaconda3/envs/palm/lib/python3.9/site-packages/seqio/dataset_providers.py", line 1375, in get_dataset
    ds = source.get_dataset(
  File "/home/henry/anaconda3/envs/palm/lib/python3.9/site-packages/seqio/experimental.py", line 369, in get_dataset
    train_ds = _get_maybe_sharded_dataset(
  File "/home/henry/anaconda3/envs/palm/lib/python3.9/site-packages/seqio/experimental.py", line 329, in _get_maybe_sharded_dataset
    num_shards = len(self._original_source.list_shards(split_))
  File "/home/henry/anaconda3/envs/palm/lib/python3.9/site-packages/seqio/dataset_providers.py", line 510, in list_shards
    return [_get_filename(info) for info in self.tfds_dataset.files(split)]
  File "/home/henry/anaconda3/envs/palm/lib/python3.9/site-packages/seqio/utils.py", line 159, in files
    split_info = self.builder.info.splits[split]
  File "/home/henry/anaconda3/envs/palm/lib/python3.9/site-packages/tensorflow_datasets/core/splits.py", line 391, in __getitem__
    raise KeyError(
KeyError: "Trying to access `splits['train']` but `splits` is empty. This likely indicate the dataset has not been generated yet."

Expected behavior To download TFDS dataset successfully.

shayne-longpre avatar Mar 17 '23 17:03 shayne-longpre

The SeqIO get_dataset function in the TFDS data source calls tfds.load which makes sure the dataset is downloaded and prepared. However, in _get_maybe_sharded_dataset in seqio/experimental.py it is getting the shards before calling get_dataset, and it requires the dataset to be downloaded for this. I'll fix this in SeqIO.

tomvdw avatar Mar 22 '23 17:03 tomvdw

Hi @tomvdw, it would be great if you can let us know an estimated time for this fix?

AadSah avatar Mar 23 '23 22:03 AadSah

I have submitted a fix in TFDS. Could you retry with tfds-nightly? If it's working, we'll release a new version of TFDS.

tomvdw avatar Mar 27 '23 11:03 tomvdw

I have submitted a fix in TFDS. Could you retry with tfds-nightly? If it's working, we'll release a new version of TFDS.

It seems that using tfds-nightly 4.8.3.dev202303300044 gives the same error.

cliang1453 avatar Mar 30 '23 04:03 cliang1453