fix(misconf): improve file analysis in defsec
Today files are analyzed mostly by file extensions rather than actual content. This sometimes leads to false identification of files that might not have any relevant content but do have a file extension that matches.
The relevant logic is here: https://github.com/aquasecurity/trivy-iac/blob/main/pkg/detection/detect.go
We need to think on how we can do this with performance in mind as there will be many files that need to be scanned.
Another Dockerfile specific improvement as part of file analysis https://github.com/aquasecurity/trivy/discussions/5526
We can detect Dockerfile by a syntax directive (covers a small percentage of all files):
- https://github.com/search?q=%22%23+syntax%3Ddocker%2Fdockerfile%3A%22&type=code
- https://github.com/search?q=%22%23+syntax%3Ddocker.io%2Fdocker%2Fdockerfile%22&type=code
As an option, we can use the Naive Bayes algorithm for language classification (dockerfile, k8s...). https://github.com/go-enry/go-enry
We can detect Dockerfile by a syntax directive (covers a small percentage of all files):
- https://github.com/search?q=%22%23+syntax%3Ddocker%2Fdockerfile%3A%22&type=code
- https://github.com/search?q=%22%23+syntax%3Ddocker.io%2Fdocker%2Fdockerfile%22&type=code
This is nice but I don't think we can fully rely on this. Most users will not bother to set this as it's metadata that isn't essential to them but to us. We can certainly add support for this, in case users have it set.
As an option, we can use the Naive Bayes algorithm for language classification (dockerfile, k8s...). https://github.com/go-enry/go-enry
This is interesting, but we should probably benchmark it before we implement it. The file analysis code needs to be simple and performant as it is in the critical path of code scanning.
We can detect Dockerfile by a syntax directive (covers a small percentage of all files):
- https://github.com/search?q=%22%23+syntax%3Ddocker%2Fdockerfile%3A%22&type=code
- https://github.com/search?q=%22%23+syntax%3Ddocker.io%2Fdocker%2Fdockerfile%22&type=code
This is nice but I don't think we can fully rely on this. Most users will not bother to set this as it's metadata that isn't essential to them but to us. We can certainly add support for this, in case users have it set.
We can combine different detection methods (including heuristics). If users use the syntax directive, we can say with certainty that it is a dockerfile and not use the other methods.
As an option, we can use the Naive Bayes algorithm for language classification (dockerfile, k8s...). https://github.com/go-enry/go-enry
This is interesting, but we should probably benchmark it before we implement it. The file analysis code needs to be simple and performant as it is in the critical path of code scanning.
enry-go contains a tokeniser which is written in C and runs quite fast.
We have done our utmost to avoid reliance on CGO. One of the reasons why people like Trivy is easy installation. We cannot lose it for performance, although performance is also important. We don't need to support so many file types, so we may want to write a simple version of go-enry.