keras icon indicating copy to clipboard operation
keras copied to clipboard

`keras.utils.get_file` does not support gzip as advertised

Open jbischof opened this issue 1 year ago • 9 comments

System information.

  • Have I written custom code (as opposed to using a stock example script provided in Keras): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): glinux 5.17.11-1rodete2-amd64
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 2.9.1
  • Python version: 3.10.0
  • Bazel version (if compiling from source): n/a
  • GPU model and memory: n/a
  • Exact command to reproduce:
gzip_path = "https://storage.googleapis.com/tf_model_garden/nlp/bert/v3/uncased_L-12_H-768_A-12.tar.gz"
# `/content/bert_base_uncased` is still a `tar.gz` file
ungzip_file = keras.utils.get_file(
    "/content/bert_base_uncased",
    gzip_path,
    extract=True,
    archive_format="tar", # bug occurs whether this arg is specified
)

Describe the problem.

get_file documentation claims to support gzip in the archive_format argument docstring (see https://www.tensorflow.org/api_docs/python/tf/keras/utils/get_file). However, I have tried several tar.gz files like the example above and they are not extracted.

Describe the current behavior. tar.gz files are downloaded but not extracted.

Describe the expected behavior. tar.gz files are downloaded and extracted. bert_base_uncased should be a folder with the following files:

tmp/temp_dir/raw/
tmp/temp_dir/raw/vocab.txt
tmp/temp_dir/raw/bert_model.ckpt.index
tmp/temp_dir/raw/bert_model.ckpt.data-00000-of-00001
tmp/temp_dir/raw/bert_config.json

Contributing.

  • Do you want to contribute a PR? (yes/no): No

Standalone code to reproduce the issue. Please see https://colab.research.google.com/drive/1OcIuIcii7CFhNudp9rIvNWNqU-VZg9SI?usp=sharing

Source code / logs. n/a see colab

jbischof avatar Aug 25 '22 17:08 jbischof

@gadagashwini, I was able to reproduce the issue on tensorflow v2.8, v2.9 and nightly. Kindly find the gist of it here.

tilakrayal avatar Aug 26 '22 09:08 tilakrayal

I'm experiencing a similar issue with the Fashion MNIST data:

train_images_url = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz'
train_labels_url = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz'
test_images_url = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz'
test_labels_url = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz'

CACHE_DIR = '/content'
CACHE_SUBDIR = 'fashion_mnist'
for url in [train_images_url, train_labels_url, test_images_url, test_labels_url]:
    tf.keras.utils.get_file(url.split('/')[-1], url, extract=True, cache_dir=CACHE_DIR, cache_subdir=CACHE_SUBDIR, archive_format='zip')

The above code will download the .gz files to /content/fashion_mnist on Google Colab, but not extract them.

jasonbrancazio avatar Aug 31 '22 19:08 jasonbrancazio

I'm pretty sure the plan was to make this open for contributions. Is that right @jbischof ?

mattdangerw avatar Sep 08 '22 04:09 mattdangerw

I'd like to contribute, will open a PR soon. @mattdangerw Is it OK?

Pouyanpi avatar Sep 08 '22 11:09 Pouyanpi

@Pouyanpi sounds good! @jbischof jump in if there's any more details to know before starting on on this.

mattdangerw avatar Sep 08 '22 17:09 mattdangerw

I haven't tested if extraction is working for other formats like .zip or .tar.bz. This is probably worth looking into.

jbischof avatar Sep 08 '22 17:09 jbischof

I'm experiencing a similar issue with the Fashion MNIST data:

train_images_url = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz'
train_labels_url = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz'
test_images_url = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz'
test_labels_url = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz'

CACHE_DIR = '/content'
CACHE_SUBDIR = 'fashion_mnist'
for url in [train_images_url, train_labels_url, test_images_url, test_labels_url]:
    tf.keras.utils.get_file(url.split('/')[-1], url, extract=True, cache_dir=CACHE_DIR, cache_subdir=CACHE_SUBDIR, archive_format='zip')

The above code will download the .gz files to /content/fashion_mnist on Google Colab, but not extract them.

This seems to be a different issue, the .gz format is neither a tarfile nor a zipfile. The method can extract a tar.gz or tgz but not gz. Thus, not supported per documentation and design.

Please see ref.

Just to reassure

>>> from pathlib import Path
>>> import tarfile
>>> import zipfile
>>> from urllib.request import urlretrieve
>>>
>>> url = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz'
>>> fpath = 'downloaded.gz'

>>> urlretrieve(url, fpath)
>>> Path(fpath).is_file()
True
>>> tarfile.is_tarfile(fpath)
False
>>> zipfile.is_zipfile(fpath)
False

Pouyanpi avatar Sep 09 '22 09:09 Pouyanpi

According to the documentation

   By default the file at the url `origin` is downloaded to the
   cache_dir `~/.keras`, placed in the cache_subdir `datasets`,
   and given the filename `fname`. The final location of a file
   `example.txt` would therefore be `~/.keras/datasets/example.txt`.

It does not seem the issue is the extraction (_extract_archive method) or related to extract and archive_format arguments. As the extracted files can be located at datadir which is cache_dir / cache_subdir.

but the implementation does not conform to fname description If an absolute path `/path/to/file.txt` is specified the file will be saved at that location.

Is it the intended usage?

If so I can write a fix. This was clearly the case in the colab notebook, also based on the code. Am I missing something?

Thanks.

p.s.

there are 2 cases:

  • if fname is a file name, the downloaded file is stored at datadir
  • if fname is an absolute path to a file, the downloaded file is stored at fname

However in both cases, the files are extracted at datadir.

Pouyanpi avatar Sep 09 '22 09:09 Pouyanpi

but the implementation does not conform to fname description If an absolute path `/path/to/file.txt` is specified the file will be saved at that location.

My mistake, actually it does, the file is downloaded and stored at this location. But, It is possible to also return the path to the extracted file, I.e., datadir.

Pouyanpi avatar Sep 09 '22 10:09 Pouyanpi

there are 2 cases:

  • if fname is a file name, the downloaded file is stored at datadir
  • if fname is an absolute path to a file, the downloaded file is stored at fname

However in both cases, the files are extracted at datadir.

@mattdangerw would you confirm this and also a comment on how to resolve it? currently the file is extracted to a different location than where it is stored.

Pouyanpi avatar Sep 22 '22 07:09 Pouyanpi

I think ideally that the method should return a path to the extracted file @Pouyanpi. In my colab we see that gzip_file points to the downloaded file /content/bert_base_uncased.tar.gz.

However, if extraction is not an issue, why does the utility create a copy of the gzip file in the same directory? I do not see any extracted file. I am offering a path for the fname arg.

jbischof avatar Sep 28 '22 21:09 jbischof

@Pouyanpi are you still interested in working on this bug? Please let us know and thanks for your contributions!

jbischof avatar Oct 14 '22 22:10 jbischof

Duplicate of #17177, which I opened to allow for a new contributor.

jbischof avatar Oct 21 '22 22:10 jbischof

Are you satisfied with the resolution of your issue? Yes No

google-ml-butler[bot] avatar Oct 21 '22 22:10 google-ml-butler[bot]