keras
keras copied to clipboard
`keras.utils.get_file` does not support gzip as advertised
System information.
- Have I written custom code (as opposed to using a stock example script provided in Keras): No
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): glinux 5.17.11-1rodete2-amd64
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): 2.9.1
- Python version: 3.10.0
- Bazel version (if compiling from source): n/a
- GPU model and memory: n/a
- Exact command to reproduce:
gzip_path = "https://storage.googleapis.com/tf_model_garden/nlp/bert/v3/uncased_L-12_H-768_A-12.tar.gz"
# `/content/bert_base_uncased` is still a `tar.gz` file
ungzip_file = keras.utils.get_file(
"/content/bert_base_uncased",
gzip_path,
extract=True,
archive_format="tar", # bug occurs whether this arg is specified
)
Describe the problem.
get_file
documentation claims to support gzip
in the archive_format
argument docstring (see https://www.tensorflow.org/api_docs/python/tf/keras/utils/get_file). However, I have tried several tar.gz
files like the example above and they are not extracted.
Describe the current behavior.
tar.gz
files are downloaded but not extracted.
Describe the expected behavior.
tar.gz
files are downloaded and extracted. bert_base_uncased
should be a folder with the following files:
tmp/temp_dir/raw/
tmp/temp_dir/raw/vocab.txt
tmp/temp_dir/raw/bert_model.ckpt.index
tmp/temp_dir/raw/bert_model.ckpt.data-00000-of-00001
tmp/temp_dir/raw/bert_config.json
- Do you want to contribute a PR? (yes/no): No
Standalone code to reproduce the issue. Please see https://colab.research.google.com/drive/1OcIuIcii7CFhNudp9rIvNWNqU-VZg9SI?usp=sharing
Source code / logs. n/a see colab
@gadagashwini, I was able to reproduce the issue on tensorflow v2.8, v2.9 and nightly. Kindly find the gist of it here.
I'm experiencing a similar issue with the Fashion MNIST data:
train_images_url = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz'
train_labels_url = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz'
test_images_url = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz'
test_labels_url = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz'
CACHE_DIR = '/content'
CACHE_SUBDIR = 'fashion_mnist'
for url in [train_images_url, train_labels_url, test_images_url, test_labels_url]:
tf.keras.utils.get_file(url.split('/')[-1], url, extract=True, cache_dir=CACHE_DIR, cache_subdir=CACHE_SUBDIR, archive_format='zip')
The above code will download the .gz files to /content/fashion_mnist on Google Colab, but not extract them.
I'm pretty sure the plan was to make this open for contributions. Is that right @jbischof ?
I'd like to contribute, will open a PR soon. @mattdangerw Is it OK?
@Pouyanpi sounds good! @jbischof jump in if there's any more details to know before starting on on this.
I haven't tested if extraction is working for other formats like .zip
or .tar.bz
. This is probably worth looking into.
I'm experiencing a similar issue with the Fashion MNIST data:
train_images_url = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz' train_labels_url = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz' test_images_url = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz' test_labels_url = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz' CACHE_DIR = '/content' CACHE_SUBDIR = 'fashion_mnist' for url in [train_images_url, train_labels_url, test_images_url, test_labels_url]: tf.keras.utils.get_file(url.split('/')[-1], url, extract=True, cache_dir=CACHE_DIR, cache_subdir=CACHE_SUBDIR, archive_format='zip')
The above code will download the .gz files to /content/fashion_mnist on Google Colab, but not extract them.
This seems to be a different issue, the .gz
format is neither a tarfile
nor a zipfile
. The method can extract a tar.gz
or tgz
but not gz
. Thus, not supported per documentation and design.
Please see ref.
Just to reassure
>>> from pathlib import Path
>>> import tarfile
>>> import zipfile
>>> from urllib.request import urlretrieve
>>>
>>> url = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz'
>>> fpath = 'downloaded.gz'
>>> urlretrieve(url, fpath)
>>> Path(fpath).is_file()
True
>>> tarfile.is_tarfile(fpath)
False
>>> zipfile.is_zipfile(fpath)
False
According to the documentation
By default the file at the url `origin` is downloaded to the
cache_dir `~/.keras`, placed in the cache_subdir `datasets`,
and given the filename `fname`. The final location of a file
`example.txt` would therefore be `~/.keras/datasets/example.txt`.
It does not seem the issue is the extraction (_extract_archive
method) or related to extract
and archive_format
arguments. As the extracted files can be located at datadir
which is cache_dir / cache_subdir
.
but the implementation does not conform to fname
description
If an absolute path `/path/to/file.txt` is specified the file will be saved at that location
.
Is it the intended usage?
If so I can write a fix. This was clearly the case in the colab notebook, also based on the code. Am I missing something?
Thanks.
p.s.
there are 2 cases:
- if
fname
is a file name, the downloaded file is stored atdatadir
- if
fname
is an absolute path to a file, the downloaded file is stored atfname
However in both cases, the files are extracted at datadir
.
but the implementation does not conform to
fname
descriptionIf an absolute path `/path/to/file.txt` is specified the file will be saved at that location
.
My mistake, actually it does, the file is downloaded and stored at this location. But, It is possible to also return the path to the extracted file, I.e., datadir
.
there are 2 cases:
- if
fname
is a file name, the downloaded file is stored atdatadir
- if
fname
is an absolute path to a file, the downloaded file is stored atfname
However in both cases, the files are extracted at
datadir
.
@mattdangerw would you confirm this and also a comment on how to resolve it? currently the file is extracted to a different location than where it is stored.
I think ideally that the method should return a path to the extracted file @Pouyanpi. In my colab we see that gzip_file
points to the downloaded file /content/bert_base_uncased.tar.gz
.
However, if extraction is not an issue, why does the utility create a copy of the gzip file in the same directory? I do not see any extracted file. I am offering a path for the fname
arg.
@Pouyanpi are you still interested in working on this bug? Please let us know and thanks for your contributions!
Duplicate of #17177, which I opened to allow for a new contributor.