extractcode icon indicating copy to clipboard operation
extractcode copied to clipboard

Check uncompressed size before extract entries of archive

Open Smascer opened this issue 4 years ago • 6 comments

Some archives can contain a big size files. e.g. (https://github.com/gcc-mirror/gcc/releases/tag/releases%2Fgcc-9.4.0 with testdata) where are tar's located and two of them are 60gb big. Extractcode extract them by default.

It is possible to add size limit for those kind of files? like an ignore option. or maybe to set the limit of the max. uncompressed size of the whole archive.

Smascer avatar Jul 28 '21 13:07 Smascer

in the libarchive2.py you can adapt the def write method of the class Entry with something like this:

if self.size > MAX_ENTRY_SIZE: return

where MAX_ENTRY_SIZE is set, e.g. with: 524288000 to skip all this big files

Smascer avatar Aug 09 '21 13:08 Smascer

It is possible to add size limit for those kind of files? like an ignore option. or maybe to set the limit of the max. uncompressed size of the whole archive.

Sure thing. I would like to make it available everywhere as an argument though with the caveat that we cannot always know the uncompressed size before effectively decompressing in some cases.

What API and behaviour do you think this should have?

pombredanne avatar Aug 09 '21 13:08 pombredanne

From what I've gathered the uncompressed size is only available in libarchive but not in e.g. 7z? It would maybe be nice to then be able to skip those completely maybe.

So there would be three modes:

  1. Normally extracting all archives
  2. Skipping too large archives where the information is available
  3. Skipping too large archives where the information is available and not extracting otherwise

Ben-Thelen avatar Aug 10 '21 06:08 Ben-Thelen

I would say best way is by default to write all the entries. But if you set via CLI e.g.: extractcode --max-archive-size 512 (with 512MB) the value will be set as argument everywhere and before writing checked and maybe skipped.

Smascer avatar Aug 10 '21 06:08 Smascer

From what I've gathered the uncompressed size is only available in libarchive but not in e.g. 7z? It would maybe be nice to then be able to skip those completely maybe.

FWIW, we may be able to get that also from 7-zip-suppoorted archives since we can parse a directory listing: https://github.com/nexB/extractcode/blob/533ac8a7cf9d83c9fb43600b6b952a62da9acc12/src/extractcode/sevenzip.py#L697

But the other approach may be to start writing in chunks until a max size is reached and then abort/rollback in these cases AND return some warning/error with the "extract event" stating that this file was not extracted because of a threshold limit.

pombredanne avatar Aug 10 '21 13:08 pombredanne

There is a related issue with a 60GB sparse file reported in https://github.com/nexB/extractcode/issues/32 reported by @goekDil For all I know I would not be surprised that this is the exact same file :)

pombredanne avatar Oct 08 '21 16:10 pombredanne