`keras.utils.get_file` does not support gzip as advertised
System information.
- Have I written custom code (as opposed to using a stock example script provided in Keras): No
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): glinux 5.17.11-1rodete2-amd64
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): 2.9.1
- Python version: 3.10.0
- Bazel version (if compiling from source): n/a
- GPU model and memory: n/a
- Exact command to reproduce:
gzip_path = "https://storage.googleapis.com/tf_model_garden/nlp/bert/v3/uncased_L-12_H-768_A-12.tar.gz"
# `/content/bert_base_uncased` is still a `tar.gz` file
ungzip_file = keras.utils.get_file(
"/content/bert_base_uncased",
gzip_path,
extract=True,
archive_format="tar", # bug occurs whether this arg is specified
)
Describe the problem.
get_file documentation claims to support gzip in the archive_format argument docstring (see https://www.tensorflow.org/api_docs/python/tf/keras/utils/get_file). However, I have tried several tar.gz files like the example above and they are not extracted.
Describe the current behavior.
tar.gz files are downloaded but not extracted.
Describe the expected behavior.
tar.gz files are downloaded and extracted. bert_base_uncased should be a folder with the following files:
tmp/temp_dir/raw/
tmp/temp_dir/raw/vocab.txt
tmp/temp_dir/raw/bert_model.ckpt.index
tmp/temp_dir/raw/bert_model.ckpt.data-00000-of-00001
tmp/temp_dir/raw/bert_config.json
- Do you want to contribute a PR? (yes/no): No
Standalone code to reproduce the issue. Please see https://colab.research.google.com/drive/1OcIuIcii7CFhNudp9rIvNWNqU-VZg9SI?usp=sharing
Source code / logs. n/a see colab
Note: This is a new version of keras-team/tf-keras#465, as we will need a new contributor.
The files are extracted correctly but please pay attention to the output dir:
https://github.com/keras-team/keras/blob/8c401c032b3021f89609eac79bd1c881b9bbc84f/keras/utils/data_utils.py#L169-L172
Thanks for the repsonse, @bhack and @divyashreepathihalli! I now see that the extracted file is stored at cache_dir even if I specify an absolute path for fname. I think this is pretty confusing:
- The tarball is placed in the absolute directory, but the extracted files are not.
- I am expecting the returned value to point to the extracted files I want but get one to the tarball instead.
- For raw files (
extract=False) I get the returned value I want in the absolute directory I specified.
It is possible to use the interface as is by overriding cache_dir to the absolute directory and cobbling together the extracted path myself, but it seems like a consistent experience between raw and zipped files would be better. However if this is not the consensus I understand.