Check uncompressed size before extract entries of archive
Some archives can contain a big size files. e.g. (https://github.com/gcc-mirror/gcc/releases/tag/releases%2Fgcc-9.4.0 with testdata) where are tar's located and two of them are 60gb big. Extractcode extract them by default.
It is possible to add size limit for those kind of files? like an ignore option. or maybe to set the limit of the max. uncompressed size of the whole archive.
in the libarchive2.py you can adapt the def write method of the class Entry with something like this:
if self.size > MAX_ENTRY_SIZE: return
where MAX_ENTRY_SIZE is set, e.g. with: 524288000 to skip all this big files
It is possible to add size limit for those kind of files? like an ignore option. or maybe to set the limit of the max. uncompressed size of the whole archive.
Sure thing. I would like to make it available everywhere as an argument though with the caveat that we cannot always know the uncompressed size before effectively decompressing in some cases.
What API and behaviour do you think this should have?
From what I've gathered the uncompressed size is only available in libarchive but not in e.g. 7z? It would maybe be nice to then be able to skip those completely maybe.
So there would be three modes:
- Normally extracting all archives
- Skipping too large archives where the information is available
- Skipping too large archives where the information is available and not extracting otherwise
I would say best way is by default to write all the entries.
But if you set via CLI e.g.: extractcode --max-archive-size 512 (with 512MB) the value will be set as argument everywhere and before writing checked and maybe skipped.
From what I've gathered the uncompressed size is only available in libarchive but not in e.g. 7z? It would maybe be nice to then be able to skip those completely maybe.
FWIW, we may be able to get that also from 7-zip-suppoorted archives since we can parse a directory listing: https://github.com/nexB/extractcode/blob/533ac8a7cf9d83c9fb43600b6b952a62da9acc12/src/extractcode/sevenzip.py#L697
But the other approach may be to start writing in chunks until a max size is reached and then abort/rollback in these cases AND return some warning/error with the "extract event" stating that this file was not extracted because of a threshold limit.
There is a related issue with a 60GB sparse file reported in https://github.com/nexB/extractcode/issues/32 reported by @goekDil For all I know I would not be surprised that this is the exact same file :)