trivy fix(misconf): improve file analysis in defsec

Today files are analyzed mostly by file extensions rather than actual content. This sometimes leads to false identification of files that might not have any relevant content but do have a file extension that matches.

The relevant logic is here: https://github.com/aquasecurity/trivy-iac/blob/main/pkg/detection/detect.go

We need to think on how we can do this with performance in mind as there will be many files that need to be scanned.

May 05 '23 15:05 simar7

Another Dockerfile specific improvement as part of file analysis https://github.com/aquasecurity/trivy/discussions/5526

Nov 07 '23 01:11 simar7

We can detect Dockerfile by a syntax directive (covers a small percentage of all files):

https://github.com/search?q=%22%23+syntax%3Ddocker%2Fdockerfile%3A%22&type=code
https://github.com/search?q=%22%23+syntax%3Ddocker.io%2Fdocker%2Fdockerfile%22&type=code

As an option, we can use the Naive Bayes algorithm for language classification (dockerfile, k8s...). https://github.com/go-enry/go-enry

Nov 09 '23 09:11 nikpivkin

We can detect Dockerfile by a syntax directive (covers a small percentage of all files):

https://github.com/search?q=%22%23+syntax%3Ddocker%2Fdockerfile%3A%22&type=code

https://github.com/search?q=%22%23+syntax%3Ddocker.io%2Fdocker%2Fdockerfile%22&type=code

This is nice but I don't think we can fully rely on this. Most users will not bother to set this as it's metadata that isn't essential to them but to us. We can certainly add support for this, in case users have it set.

As an option, we can use the Naive Bayes algorithm for language classification (dockerfile, k8s...). https://github.com/go-enry/go-enry

This is interesting, but we should probably benchmark it before we implement it. The file analysis code needs to be simple and performant as it is in the critical path of code scanning.

Nov 10 '23 04:11 simar7

We can detect Dockerfile by a syntax directive (covers a small percentage of all files):

https://github.com/search?q=%22%23+syntax%3Ddocker%2Fdockerfile%3A%22&type=code

https://github.com/search?q=%22%23+syntax%3Ddocker.io%2Fdocker%2Fdockerfile%22&type=code

This is nice but I don't think we can fully rely on this. Most users will not bother to set this as it's metadata that isn't essential to them but to us. We can certainly add support for this, in case users have it set.

We can combine different detection methods (including heuristics). If users use the syntax directive, we can say with certainty that it is a dockerfile and not use the other methods.

As an option, we can use the Naive Bayes algorithm for language classification (dockerfile, k8s...). https://github.com/go-enry/go-enry

This is interesting, but we should probably benchmark it before we implement it. The file analysis code needs to be simple and performant as it is in the critical path of code scanning.

enry-go contains a tokeniser which is written in C and runs quite fast.

Nov 10 '23 06:11 nikpivkin

We have done our utmost to avoid reliance on CGO. One of the reasons why people like Trivy is easy installation. We cannot lose it for performance, although performance is also important. We don't need to support so many file types, so we may want to write a simple version of go-enry.

Nov 13 '23 03:11 knqyf263