trufflehog icon indicating copy to clipboard operation
trufflehog copied to clipboard

Archive handler error: `matching $format: read /tmp/tmp_archive123456: is a directory`

Open rgmz opened this issue 5 months ago • 3 comments

Please review the Community Note before submitting

TruffleHog Version

3.67.5

Trace Output

2024-02-11T20:04:34-05:00       error   trufflehog      error unarchiving chunk.        {"source_manager_worker_id": "Ap0tp", "repo": "https://github.com/Mzack9999/subnet.git", "commit": "4bcb643", "path": "src/golang.org/x/tools/go/gcimporter15/testdata/versions/test_go1.7_1.a", "timeout": 30, "error": "matching bz2: read /tmp/tmp_archive3933205369: is a directory"}

Expected Behavior

The archive handler should not call archive.Identify on directories.

Actual Behavior

Frequent errors like matching bz2: read /tmp/tmp_archive3933205369: is a directory are printed to the console.

Steps to Reproduce

Scan https://github.com/Mzack9999/subnet.git.

Environment

N/A

Additional Context

The error comes from the underlying mbolt/archiver library, specifically the Indentify function.

So far I have only encountered this with .a files. I haven't seen it with .rpm files, but I also haven't explicitly tested it with those either. This may be a slightly different problem from #2071 or the fix was incomplete. I am far too sick to properly investigate at the moment (sorry).

Edit: I've also encountered it with .rlib and .rpm files.

2024-02-11T22:40:42-05:00       error   trufflehog      error unarchiving chunk.        {"source_manager_worker_id": "MugHm", "repo": "https://github.com/bevyengine/bevy.git", "commit": "a5a7edf", "path": "crates/glsl-to-spirv/target/debug/deps/libtypenum-fb88411ee80f9742.rlib", "timeout": 30, "error": "matching sz: read /tmp/tmp_archive3144325389: is a directory"}
...
2024-02-12T10:17:06-05:00       error   trufflehog      error unarchiving chunk.        {"source_manager_worker_id": "rMZzm", "repo": "https://github.com/elastic/beats.git", "commit": "815ef78", "path": "dev-tools/vendor/github.com/cavaliercoder/go-rpm/testdata/epel-release-7-5.noarch.rpm", "timeout": 30, "error": "matching lz4: read /tmp/tmp_archive1073985454: is a directory"}

Edit2: stack trace from inserting a manual panic

panic: matching xz: read /tmp/tmp_archive3262228446: is a directory

goroutine 1018 [running]:
github.com/trufflesecurity/trufflehog/v3/pkg/handlers.(*Archive).openArchive(0xc0214637d0, {0x3ee7680?, 0xc0212bc630}, 0x0, {0x3eac220, 0xc01f4cc178}, 0xc01f497bc0)
        /tmp/trufflehog/pkg/handlers/archive.go:114 +0x59b
github.com/trufflesecurity/trufflehog/v3/pkg/handlers.(*Archive).FromFile.func1()
        /tmp/trufflehog/pkg/handlers/archive.go:86 +0x1da
created by github.com/trufflesecurity/trufflehog/v3/pkg/handlers.(*Archive).FromFile in goroutine 574
        /tmp/trufflehog/pkg/handlers/archive.go:81 +0xf6

References

  • #2071

rgmz avatar Feb 12 '24 01:02 rgmz

@ahrav @bugbaba The "is a directory" error is surprisingly convoluted, but I think I've narrowed down the cause. #2178 didn't fix the root cause, only reduced the likelihood of it happening.

  1. handlers.processHandler calls Archive.HandleSpecialized https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/handlers.go#L72-L74
  2. HandleSpecialized calls either extractDebContent or extractRpmContent https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L328
  3. extractDebContent/extractRpmContent creates a directory in /tmp/ and extracts the contents to it https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L354-L361
  4. extractDebContent/extractRpmContent then calls handleExtractedFiles https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L374
  5. handleExtractedFiles iterates through each file in the temporary directory and tries to assign a value to dataArchiveName https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L481-L500
  6. Each iteration calls the handleFile func (directories are skipped) https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L367-L372
  7. The handleFile func calls handleNestedFileMIME
  8. handleNestedFileMIME returns an empty string due to 4 possible cases https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L422-L425 https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L428-L431 https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L433-L437 https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L440-L442
  9. No value is assigned to dataArchiveName, so handleExtractedFiles returns "" https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L503
  10. extractDebContent/extractRpmContent returns openDataArchive, which opens the path /tmp/generatedDirName + dataArchiveName ("") and returns a io.ReadCloser. https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L379
  11. extractDebContent/extractRpmContent returns that io.ReadCloser to HandleSpecialized
  12. HandleSpecialized returns that io.ReadCloser to handlers.processHandler.
  13. processHandler calls archive.FromFile https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/handlers.go#L74-L77
  14. FromFile calls openArchive https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L75-L86
  15. openArchive calls mbolt/archiver.Identify https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/archive.go#L98-L107
  16. Identify goes "wtf, this is a directory and not a file" https://github.com/mholt/archiver/blob/81f9e06b11ad6ba424f8311c0bc18ceb01f2b67a/formats.go#L53

What does all this mean?

I don't know. I understand how this happens, but not why nor what the proper fix is. The obvious solution is to throw in another IsDir() check somewhere, but at this point I'm not sure if the contents of these specialized archives are actually being scanned. (e.g., I don't understand why handleExtractedFiles iterates through each file but only returns a single value for dataArchiveName.)

It seems to me that the core design issue is that archive.FromFile assumes it's being given a file, but archive.HandleSpecialized decompresses archives into a directory first.

https://github.com/trufflesecurity/trufflehog/blob/b69e2c6cc1a5af76b34c594ec8963845bf00f37e/pkg/handlers/handlers.go#L74-L77

rgmz avatar Feb 13 '24 00:02 rgmz

I think we are planning on re-working the archive handler in general. There is too much recursion for my liking 😅 Also as you alluded to certain archive types that aren't .deb get un-archived correctly but then our logic to try and pass the unarchived directory to the regular archive handler fails, since it only works with files.

ahrav avatar Mar 05 '24 18:03 ahrav

There is too much recursion for my liking 😅

Tbh, I'm not sure how you would handle directories without recursion. The logic could definitely be cleaner.

Also as you alluded to certain archive types that aren't .deb get un-archived correctly but then our logic to try and pass the unarchived directory to the regular archive handler fails, since it only works with files.

Perhaps this is a naive suggestion, but would modifying FromFile to accept both files and directories (i.e., walk them) fix the issue? It would avoid a large rewrite of the current logic/API.

rgmz avatar Apr 07 '24 16:04 rgmz