extractcode Check uncompressed size before extract entries of archive

Some archives can contain a big size files. e.g. (https://github.com/gcc-mirror/gcc/releases/tag/releases%2Fgcc-9.4.0 with testdata) where are tar's located and two of them are 60gb big. Extractcode extract them by default.

It is possible to add size limit for those kind of files? like an ignore option. or maybe to set the limit of the max. uncompressed size of the whole archive.

Jul 28 '21 13:07 Smascer

in the libarchive2.py you can adapt the def write method of the class Entry with something like this:

if self.size > MAX_ENTRY_SIZE: return

where MAX_ENTRY_SIZE is set, e.g. with: 524288000 to skip all this big files

Aug 09 '21 13:08 Smascer

It is possible to add size limit for those kind of files? like an ignore option. or maybe to set the limit of the max. uncompressed size of the whole archive.

Sure thing. I would like to make it available everywhere as an argument though with the caveat that we cannot always know the uncompressed size before effectively decompressing in some cases.

What API and behaviour do you think this should have?

Aug 09 '21 13:08 pombredanne

From what I've gathered the uncompressed size is only available in libarchive but not in e.g. 7z? It would maybe be nice to then be able to skip those completely maybe.

So there would be three modes:

Normally extracting all archives
Skipping too large archives where the information is available
Skipping too large archives where the information is available and not extracting otherwise

Aug 10 '21 06:08 Ben-Thelen

I would say best way is by default to write all the entries. But if you set via CLI e.g.: extractcode --max-archive-size 512 (with 512MB) the value will be set as argument everywhere and before writing checked and maybe skipped.

Aug 10 '21 06:08 Smascer

From what I've gathered the uncompressed size is only available in libarchive but not in e.g. 7z? It would maybe be nice to then be able to skip those completely maybe.

FWIW, we may be able to get that also from 7-zip-suppoorted archives since we can parse a directory listing: https://github.com/nexB/extractcode/blob/533ac8a7cf9d83c9fb43600b6b952a62da9acc12/src/extractcode/sevenzip.py#L697

But the other approach may be to start writing in chunks until a max size is reached and then abort/rollback in these cases AND return some warning/error with the "extract event" stating that this file was not extracted because of a threshold limit.

Aug 10 '21 13:08 pombredanne

There is a related issue with a 60GB sparse file reported in https://github.com/nexB/extractcode/issues/32 reported by @goekDil For all I know I would not be surprised that this is the exact same file :)

Oct 08 '21 16:10 pombredanne