extractcode extractcode's behaviour and error message output on damaged archives

when running extractcode on gcc-4.9 (download here https://packages.debian.org/jessie/all/gcc-4.9-source/download) extractcode fails with these error messages:

nakami@debian:~/Downloads/scancode-toolkit-developNEW$ ./extractcode samples/gcc-4.9-source_4.9.2-10_all.deb 
Extracting archives...
  [####################################]                                             
ERROR extracting: writer-big.tar: 'Truncated tar archive'
ERROR extracting: test-trailing-junk.zip: 'Invalid central directory signature'
ERROR extracting: issue6550.gz: 'Error -3 while decompressing: too many length or distance symbols'
Extracting done.

with the --verbose flag the error messages look like this:

nakami@debian:~/Downloads/scancode-toolkit-developNEW$ ./extractcode --verbose samples/gcc-4.9-source_4.9.2-10_all.deb 
Extracting archives...
Extracting: gcc-4.9-source_4.9.2-10_all.deb
[...]
Extracting: changelog.Debian.gz

ERROR extracting: writer-big.tar: 'Truncated tar archive'
ERROR extracting: test-trailing-junk.zip: 'Invalid central directory signature'
ERROR extracting: issue6550.gz: 'Error -3 while decompressing: too many length or distance symbols'
ERROR extracting: writer-big.tar: 'Truncated tar archive'
ERROR extracting: test-trailing-junk.zip: 'Invalid central directory signature'
ERROR extracting: issue6550.gz: 'Error -3 while decompressing: too many length or distance symbols'
ERROR extracting: writer-big.tar: 'Truncated tar archive'
ERROR extracting: test-trailing-junk.zip: 'Invalid central directory signature'
ERROR extracting: issue6550.gz: 'Error -3 while decompressing: too many length or distance symbols'
ERROR extracting: writer-big.tar: 'Truncated tar archive'
ERROR extracting: test-trailing-junk.zip: 'Invalid central directory signature'
ERROR extracting: issue6550.gz: 'Error -3 while decompressing: too many length or distance symbols'
Extracting done.

are you aware of this problem? it seems that either the depth of further archives within the initial archive, the depth that those have or both depths sumed up is the problem here.

Nov 22 '16 16:11 nakami

Thanks! I was not aware of this problem. There are interesting cases there as some of these archives may be damaged archives used by GCC for testing or these could be issues in extractcode

Nov 22 '16 18:11 pombredanne

interesting! yes, this makes sense. how might someone approach this? obviously, if such files are used for testing within the component and are faulty by purpose, we should avoid extracting those.

does extractcode continue the job after failing on such corrupt archive files?

Nov 22 '16 19:11 nakami

does extractcode continue the job after failing on such corrupt archive files?

yes, the extraction is never interrupted at large, it only choke on faulty archives (and usually tries a best effort in these cases, but obviously it is not trying hard enough here) and then keeps on trucking on the rest.

how might someone approach this?

This is not entirely trivial.... I guess there are a couple ways. One idea would be to combine extraction with file classification: e.g. if a failed-to-extract archive is part of a directory classified as "test" files, then the error could re-qualified as a warning or silenced.

Another idea is to brute force the problem and to maintain a list of test archives known to fail extracting for the (few?) extraction (or other) tools that would keep such test files, eg. gcc, tar, infozip, gunzip, libarchive ...

Yet another way would be to bypass the problem entirely: if GCC 4.9 had been pre-scanned and that scan peer reviewed by the community, then the fact that it has some test files that do no extract correctly becomes moot, does it?

Nov 22 '16 21:11 pombredanne

So here is the deal:

[...]gcc-4.9.2/libgo/go/archive/zip/testdata/test-trailing-junk.zip extracts alright with unzip and roller. So this is a bug and it should extract correctly. I will use that as a new test file
[...]/usr/src/gcc-4.9/gcc-4.9.2-dfsg.tar.xz-extract/gcc-4.9.2/libgo/go/archive/tar/testdata/writer-big.tar fails to extract with tar

$ tar -xf writer-big.tar 
tar: Unexpected EOF in archive
tar: rmtlseek not stopped at a record boundary
tar: Error is not recoverable: exiting now

This looks like a damaged-on-purpose test file. It should however extracts correctly or at least partially with extractcode. It may contain a fake, pretending to be super big 16GB file. This is another test case.

[...]usr/src/gcc-4.9/gcc-4.9.2-dfsg.tar.xz-extract/gcc-4.9.2/libgo/go/compress/gzip/testdata/issue6550.gz reports as an encrypted gzip file.

$ file *
issue6550.gz: gzip compressed data, extra field, encrypted

and gunzip fails to decompress it:

$ gunzip -k issue6550.gz 
gzip: issue6550.gz is encrypted -- not supported

Here extractcode would never be able to process it alright because of the encryption. Yet we should likely report a warning and a more explicit message instead.

Note that all these are test files used in GCC so overall so these are annoying yet probably not critical issues.

What's your take?

Nov 23 '16 02:11 pombredanne

Thank you for your comprehensive response.

yes, the extraction is never interrupted at large

This was my main concern. I'm glad extractcode behaves this way.

Also, I already do get some information about the errors. I don't think that we're on the wrong track here - I rather misinterpreted the error messages to be critical. I'd honestly leave it as it is in regards to extractcode's behaviour. However, further information in the error message might be handy like you said. I'd really like to see a summary on what was going on in the end, e.g. "extracted n files successfully, failed on m files".

For the detection after a failed extraction attempt you might - as you mentioned:

check the path or archive name whether it looks suspiscious (regex on directory- or archivename?) (probably the easiest)
create a checksum on the archive and compare it against a list of known testfiles (or problematic files overall?) (i don't even know whether it is possible to create a valid checksum if the file is damaged in some way) (might create a large database and therefore doesn't sound worthwhile for this minor suggestion)
maintain a catalog on (big) components and which files should be ignored (I think this is a huge overkill, too)

I will update the title of the issue as it is misleading with the current state of knowledge. Furthermore I don't have any more input. Thank you kindly.

Nov 23 '16 15:11 nakami

In terms or resolution, I think this would work best:

classify code to detect if this is likely part of some test data set (and this has many other benefits and applications beside archive extraction: for instance when a license is detected in test files and data vs. the main code, its relative impact and importance is lesser as the test code would not typically be part of a redistributed production build)
better report the importance of extraction failures (including updating the documentation) taking into account the classification in 1.
update the extractcode doc to explain better the behaviour in case of failures
fix the bugs that exist anyway for some of the test archives at issue here as some should still be extracted without error.

Nov 25 '16 07:11 pombredanne

extractcode extractcode copied to clipboard

extractcode's behaviour and error message output on damaged archives

extractcode
extractcode copied to clipboard