cached_path
cached_path copied to clipboard
Caching a single file in a tar/zip should not extract all files
Is your feature request related to a problem? Please describe.
When extracting a single file from a tar/zip archive, currently all files in the archive are extracted using the tarfile.extractall function even though only a single file was requested. This leads to bloated cache directories containing unnecessary files. Note below, I only request dummy.txt but I also get folder:
>>> import cached_path
>>> path = "https://github.com/allenai/cached_path/raw/refs/heads/main/test_fixtures/utf-8_sample/archives/utf-8.tar.gz!dummy.txt"
>>> f = cached_path.cached_path(path, extract_archive=True, quiet=True)
>>> [x.name for x in f.parent.iterdir()]
['dummy.txt', 'folder']
Describe the solution you'd like
Extract only the file requested using the tarfile.extract function rather than the tarfile.extractall function.
Describe alternatives you've considered None
Additional context None