datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Downloading to S3 with TFDS errors out because gfile doesn't create virtual path on S3

Open mathemakitten opened this issue 2 years ago • 1 comments

Short description The TFDS download manager uses the etils Tensorflow backend, but unlike in Google Cloud Storage which successfully creates virtual paths with no files within them, gfile.mkdir doesn't actually create a virtual directory on S3 unless there's an actual file to be written. This means that the check here fails even though gfile.makedirs(path) runs successfully. e.g.,

  File ".../lib/python3.8/site-packages/tensorflow_datasets/core/download/download_manager.py", line 39
8, in _download
    download_tmp_dir.mkdir()
  File ".../lib/python3.8/site-packages/etils/epath/gpath.py", line 180, in mkdir
    self._backend.mkdir(self._path_str, exist_ok=exist_ok)
  File ".../lib/python3.8/site-packages/etils/epath/backend.py", line 239, in mkdir
    raise FileExistsError(f'Cannot create dir. {path} is not a directory')
FileExistsError: Cannot create dir. s3://... is not a directory

Environment information

  • Operating System: Ubuntu 20.04.5 LTS

  • Python version: 3.8.10

  • tensorflow-datasets/tfds-nightly version: tensorflow-datasets 4.8.2

  • tensorflow/tf-nightly version: tensorflow-cpu==2.11.0

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ?

Reproduction instructions

You can verify that this is the issue by doing tf.io.gfile.mkdir(p) and then tf.io.gfile.isdir(p) for a new path on S3 and then a new path on GCS. On GCS isdir will return True, but on S3 it will return False.

Expected behavior Unfortunately since S3 doesn't seem to create virtual paths when gfile.mkdir is called, checking with isdir will always fail; not sure if there's a nicer solution without handling the download to S3 differently or making the change in etils.

mathemakitten avatar Feb 12 '23 02:02 mathemakitten

Relatedly, the download manager will fail again when downloading to S3 for the same reason (parent dir doesn't exist because it's an empty dir) because of this line: https://github.com/tensorflow/datasets/blob/1ad5b6c8cfab11eee27031c2aac22e0beeefb0a7/tensorflow_datasets/core/download/download_manager.py#L518

I added if not path.parent.startswith('s3://'): for now to avoid dealing with this, but I'm sure there's a better way to avoid special-casing for S3.

mathemakitten avatar Feb 12 '23 03:02 mathemakitten