tf-keras icon indicating copy to clipboard operation
tf-keras copied to clipboard

`keras.utils.get_file` does not support gzip as advertised

Open jbischof opened this issue 3 years ago • 3 comments

System information.

  • Have I written custom code (as opposed to using a stock example script provided in Keras): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): glinux 5.17.11-1rodete2-amd64
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 2.9.1
  • Python version: 3.10.0
  • Bazel version (if compiling from source): n/a
  • GPU model and memory: n/a
  • Exact command to reproduce:
gzip_path = "https://storage.googleapis.com/tf_model_garden/nlp/bert/v3/uncased_L-12_H-768_A-12.tar.gz"
# `/content/bert_base_uncased` is still a `tar.gz` file
ungzip_file = keras.utils.get_file(
    "/content/bert_base_uncased",
    gzip_path,
    extract=True,
    archive_format="tar", # bug occurs whether this arg is specified
)

Describe the problem.

get_file documentation claims to support gzip in the archive_format argument docstring (see https://www.tensorflow.org/api_docs/python/tf/keras/utils/get_file). However, I have tried several tar.gz files like the example above and they are not extracted.

Describe the current behavior. tar.gz files are downloaded but not extracted.

Describe the expected behavior. tar.gz files are downloaded and extracted. bert_base_uncased should be a folder with the following files:

tmp/temp_dir/raw/
tmp/temp_dir/raw/vocab.txt
tmp/temp_dir/raw/bert_model.ckpt.index
tmp/temp_dir/raw/bert_model.ckpt.data-00000-of-00001
tmp/temp_dir/raw/bert_config.json

Contributing.

  • Do you want to contribute a PR? (yes/no): No

Standalone code to reproduce the issue. Please see https://colab.research.google.com/drive/1OcIuIcii7CFhNudp9rIvNWNqU-VZg9SI?usp=sharing

Source code / logs. n/a see colab

jbischof avatar Oct 21 '22 22:10 jbischof

Note: This is a new version of keras-team/tf-keras#465, as we will need a new contributor.

jbischof avatar Oct 21 '22 22:10 jbischof

The files are extracted correctly but please pay attention to the output dir:

https://github.com/keras-team/keras/blob/8c401c032b3021f89609eac79bd1c881b9bbc84f/keras/utils/data_utils.py#L169-L172

bhack avatar Oct 25 '22 15:10 bhack

Thanks for the repsonse, @bhack and @divyashreepathihalli! I now see that the extracted file is stored at cache_dir even if I specify an absolute path for fname. I think this is pretty confusing:

  • The tarball is placed in the absolute directory, but the extracted files are not.
  • I am expecting the returned value to point to the extracted files I want but get one to the tarball instead.
  • For raw files (extract=False) I get the returned value I want in the absolute directory I specified.

It is possible to use the interface as is by overriding cache_dir to the absolute directory and cobbling together the extracted path myself, but it seems like a consistent experience between raw and zipped files would be better. However if this is not the consensus I understand.

jbischof avatar Nov 04 '22 22:11 jbischof