trufflehog
trufflehog copied to clipboard
Archive handler error: `matching $format: read /tmp/tmp_archive123456: is a directory`
Please review the Community Note before submitting
TruffleHog Version
3.67.5
Trace Output
2024-02-11T20:04:34-05:00 error trufflehog error unarchiving chunk. {"source_manager_worker_id": "Ap0tp", "repo": "https://github.com/Mzack9999/subnet.git", "commit": "4bcb643", "path": "src/golang.org/x/tools/go/gcimporter15/testdata/versions/test_go1.7_1.a", "timeout": 30, "error": "matching bz2: read /tmp/tmp_archive3933205369: is a directory"}
Expected Behavior
The archive handler should not call archive.Identify
on directories.
Actual Behavior
Frequent errors like matching bz2: read /tmp/tmp_archive3933205369: is a directory
are printed to the console.
Steps to Reproduce
Scan https://github.com/Mzack9999/subnet.git.
Environment
N/A
Additional Context
The error comes from the underlying mbolt/archiver library, specifically the Indentify
function.
So far I have only encountered this with .a
files. I haven't seen it with .rpm
files, but I also haven't explicitly tested it with those either. This may be a slightly different problem from #2071 or the fix was incomplete. I am far too sick to properly investigate at the moment (sorry).
Edit: I've also encountered it with .rlib
and .rpm
files.
2024-02-11T22:40:42-05:00 error trufflehog error unarchiving chunk. {"source_manager_worker_id": "MugHm", "repo": "https://github.com/bevyengine/bevy.git", "commit": "a5a7edf", "path": "crates/glsl-to-spirv/target/debug/deps/libtypenum-fb88411ee80f9742.rlib", "timeout": 30, "error": "matching sz: read /tmp/tmp_archive3144325389: is a directory"}
...
2024-02-12T10:17:06-05:00 error trufflehog error unarchiving chunk. {"source_manager_worker_id": "rMZzm", "repo": "https://github.com/elastic/beats.git", "commit": "815ef78", "path": "dev-tools/vendor/github.com/cavaliercoder/go-rpm/testdata/epel-release-7-5.noarch.rpm", "timeout": 30, "error": "matching lz4: read /tmp/tmp_archive1073985454: is a directory"}
Edit2: stack trace from inserting a manual panic
panic: matching xz: read /tmp/tmp_archive3262228446: is a directory
goroutine 1018 [running]:
github.com/trufflesecurity/trufflehog/v3/pkg/handlers.(*Archive).openArchive(0xc0214637d0, {0x3ee7680?, 0xc0212bc630}, 0x0, {0x3eac220, 0xc01f4cc178}, 0xc01f497bc0)
/tmp/trufflehog/pkg/handlers/archive.go:114 +0x59b
github.com/trufflesecurity/trufflehog/v3/pkg/handlers.(*Archive).FromFile.func1()
/tmp/trufflehog/pkg/handlers/archive.go:86 +0x1da
created by github.com/trufflesecurity/trufflehog/v3/pkg/handlers.(*Archive).FromFile in goroutine 574
/tmp/trufflehog/pkg/handlers/archive.go:81 +0xf6
References
- #2071
@ahrav @bugbaba The "is a directory" error is surprisingly convoluted, but I think I've narrowed down the cause. #2178 didn't fix the root cause, only reduced the likelihood of it happening.
-
handlers.processHandler
callsArchive.HandleSpecialized
https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/handlers.go#L72-L74 -
HandleSpecialized
calls eitherextractDebContent
orextractRpmContent
https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L328 -
extractDebContent
/extractRpmContent
creates a directory in /tmp/ and extracts the contents to it https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L354-L361 -
extractDebContent
/extractRpmContent
then callshandleExtractedFiles
https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L374 -
handleExtractedFiles
iterates through each file in the temporary directory and tries to assign a value todataArchiveName
https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L481-L500 - Each iteration calls the
handleFile
func (directories are skipped) https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L367-L372 - The
handleFile
func callshandleNestedFileMIME
-
handleNestedFileMIME
returns an empty string due to 4 possible cases https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L422-L425 https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L428-L431 https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L433-L437 https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L440-L442 - No value is assigned to
dataArchiveName
, sohandleExtractedFiles
returns""
https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L503 -
extractDebContent
/extractRpmContent
returnsopenDataArchive
, which opens the path/tmp/generatedDirName
+dataArchiveName
(""
) and returns aio.ReadCloser
. https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L379 -
extractDebContent
/extractRpmContent
returns thatio.ReadCloser
toHandleSpecialized
-
HandleSpecialized
returns thatio.ReadCloser
tohandlers.processHandler
. -
processHandler
callsarchive.FromFile
https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/handlers.go#L74-L77 -
FromFile
callsopenArchive
https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L75-L86 -
openArchive
callsmbolt/archiver.Identify
https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L98-L107 -
Identify
goes "wtf, this is a directory and not a file" https://github.com/mholt/archiver/blob/81f9e06b11ad6ba424f8311c0bc18ceb01f2b67a/formats.go#L53
What does all this mean?
I don't know. I understand how this happens, but not why nor what the proper fix is. The obvious solution is to throw in another IsDir()
check somewhere, but at this point I'm not sure if the contents of these specialized archives are actually being scanned. (e.g., I don't understand why handleExtractedFiles
iterates through each file but only returns a single value for dataArchiveName
.)
It seems to me that the core design issue is that archive.FromFile
assumes it's being given a file, but archive.HandleSpecialized
decompresses archives into a directory first.
https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/handlers.go#L74-L77
I think we are planning on re-working the archive handler in general. There is too much recursion for my liking 😅 Also as you alluded to certain archive types that aren't .deb get un-archived correctly but then our logic to try and pass the unarchived directory to the regular archive handler fails, since it only works with files.
There is too much recursion for my liking 😅
Tbh, I'm not sure how you would handle directories without recursion. The logic could definitely be cleaner.
Also as you alluded to certain archive types that aren't .deb get un-archived correctly but then our logic to try and pass the unarchived directory to the regular archive handler fails, since it only works with files.
Perhaps this is a naive suggestion, but would modifying FromFile
to accept both files and directories (i.e., walk them) fix the issue? It would avoid a large rewrite of the current logic/API.