Downloading to S3 with TFDS errors out because gfile doesn't create virtual path on S3
Short description
The TFDS download manager uses the etils Tensorflow backend, but unlike in Google Cloud Storage which successfully creates virtual paths with no files within them, gfile.mkdir doesn't actually create a virtual directory on S3 unless there's an actual file to be written. This means that the check here fails even though gfile.makedirs(path) runs successfully. e.g.,
File ".../lib/python3.8/site-packages/tensorflow_datasets/core/download/download_manager.py", line 39
8, in _download
download_tmp_dir.mkdir()
File ".../lib/python3.8/site-packages/etils/epath/gpath.py", line 180, in mkdir
self._backend.mkdir(self._path_str, exist_ok=exist_ok)
File ".../lib/python3.8/site-packages/etils/epath/backend.py", line 239, in mkdir
raise FileExistsError(f'Cannot create dir. {path} is not a directory')
FileExistsError: Cannot create dir. s3://... is not a directory
Environment information
-
Operating System: Ubuntu 20.04.5 LTS
-
Python version: 3.8.10
-
tensorflow-datasets/tfds-nightlyversion: tensorflow-datasets 4.8.2 -
tensorflow/tf-nightlyversion: tensorflow-cpu==2.11.0 -
Does the issue still exists with the last
tfds-nightlypackage (pip install --upgrade tfds-nightly) ?
Reproduction instructions
You can verify that this is the issue by doing tf.io.gfile.mkdir(p) and then tf.io.gfile.isdir(p) for a new path on S3 and then a new path on GCS. On GCS isdir will return True, but on S3 it will return False.
Expected behavior
Unfortunately since S3 doesn't seem to create virtual paths when gfile.mkdir is called, checking with isdir will always fail; not sure if there's a nicer solution without handling the download to S3 differently or making the change in etils.
Relatedly, the download manager will fail again when downloading to S3 for the same reason (parent dir doesn't exist because it's an empty dir) because of this line: https://github.com/tensorflow/datasets/blob/1ad5b6c8cfab11eee27031c2aac22e0beeefb0a7/tensorflow_datasets/core/download/download_manager.py#L518
I added if not path.parent.startswith('s3://'): for now to avoid dealing with this, but I'm sure there's a better way to avoid special-casing for S3.