cached_path icon indicating copy to clipboard operation
cached_path copied to clipboard

Caching a single file in a tar/zip should not extract all files

Open baumgold opened this issue 6 days ago • 0 comments

Is your feature request related to a problem? Please describe. When extracting a single file from a tar/zip archive, currently all files in the archive are extracted using the tarfile.extractall function even though only a single file was requested. This leads to bloated cache directories containing unnecessary files. Note below, I only request dummy.txt but I also get folder:

>>> import cached_path
>>> path = "https://github.com/allenai/cached_path/raw/refs/heads/main/test_fixtures/utf-8_sample/archives/utf-8.tar.gz!dummy.txt"
>>> f = cached_path.cached_path(path, extract_archive=True, quiet=True)
>>> [x.name for x in f.parent.iterdir()]
['dummy.txt', 'folder']

Describe the solution you'd like Extract only the file requested using the tarfile.extract function rather than the tarfile.extractall function.

Describe alternatives you've considered None

Additional context None

baumgold avatar Jan 12 '26 03:01 baumgold