trufflehog
trufflehog copied to clipboard
[feat] - Optimize detector performance by reducing data passed to regex
Description:
This PR introduces optimizations to improve the performance of detectors by reducing the amount of data passed to the regex within the FromData
method. The changes leverage the knowledge of keyword positions to extract relevant portions of the chunk data, where the secret is likely to reside.
Key changes:
-
Introduced
DetectorMatch
struct to represent a detected pattern’s metadata, including the detector key, detector instance, and a slice ofmatchSpan
structs representing the start and end offsets of matched keywords within the chunk. -
Added
matchSpan
struct to represent a single occurrence of a matched keyword, containing the start and end byte offsets within the chunk. -
Implemented
Matches
method forDetectorMatch
to extract the relevant portions of the chunk data based on the start and end positions of each match. The end position is determined by taking the minimum of the keyword position +maxMatchLength
(set to 300) and the length of the chunk data. -
Introduced
FindDetectorMatches
function (previouslyPopulateMatchingDetectors
) to return a slice ofDetectorMatch
instances, each containing the detector key, detector, and a slice of matches. Adjacent or overlapping matches are merged using themergeMatches
function to avoid duplicating or overlapping the matched portions of the chunk data. -
Updated the detection logic to use the
Matches
method ofDetectorMatch
to extract the relevant portions of the chunk data before passing them to theFromData
method of the detector. -
Introduced
MaxSecretSizeProvider
interface that detectors can optionally implement to provide a custom maximum size for the secrets they detect. The interface includes a single methodProvideMaxSecretSize() int64
that returns the maximum size of the secret the detector expects to find. -
As part of the
FindDetectorMatches
function, it checks if a detector implements theMaxSecretSizeProvider
interface. If implemented, theProvideMaxSecretSize
method is called to obtain the detector-specific maximum secret size, which is used to determine the end position of the match span. If the interface is not implemented, the defaultmaxMatchLength
constant is used. -
Implemented the
MaxSecretSizeProvider
interface in the relevant detectors (PrivateKeyDetector
,GCPDetector
, andGCPApplicationDefaultCredentialsDetector
) and provided appropriate values for the maximum secret size based on the expected size of the secrets they detect. I might be missing some detectors that should implement this interface... i just can't think of them right now 😞
The optimization is based on the assumption that most secrets shouldn’t exceed a certain length from the keyword’s position. By default, the maxMatchLength
constant is set to 300 characters. However, detectors that require a larger or smaller max size can implement the MaxSecretSizeProvider
interface and provide their own value through the ProvideMaxSecretSize
method.
These changes significantly reduce the amount of data the regex within FromData
has to process, leading to improved detector performance while still ensuring accurate secret detection. The introduction of the MaxSecretSizeProvider
interface allows for flexibility in handling different secret sizes based on the specific requirements of each detector.
Sequence Diagram
Benchmarks
Benchmark assessing the performance of FromData
with verification disabled across various chunk sizes.
orange: old chunk size (10kB)
green: new chunk size (512B overestimate for most detectors)
Checklist:
- [ ] Tests passing (
make test-community
)? - [ ] Lint passing (
make lint
this requires golangci-lint)?
The end position is determined by taking the minimum of the keyword position + maxMatchLength (set to 300) and the length of the chunk data.
... 8. Implemented the MaxSecretSizeProvider interface in the relevant detectors (PrivateKeyDetector, GCPDetector, and GCPApplicationDefaultCredentialsDetector) and provided appropriate values for the maximum secret size based on the expected size of the secrets they detect. I might be missing some detectors that should implement this interface... i just can't think of them right now 😞
While I think this is a good idea, it would be a mistake to set a default max length. I can think of several detectors that can easily exceed this — JWT, private key, GCP, and Docker (#2677) to name a few. Because there are hundreds of detectors, the safer approach would be to make this opt-in.
This would also interfere with detectors that require multiple parts (e.g., client ID & secret, username & password, secret & URL).
Edit: Incidentally, this would likely solve #2739.
The end position is determined by taking the minimum of the keyword position + maxMatchLength (set to 300) and the length of the chunk data. ... 8. Implemented the MaxSecretSizeProvider interface in the relevant detectors (PrivateKeyDetector, GCPDetector, and GCPApplicationDefaultCredentialsDetector) and provided appropriate values for the maximum secret size based on the expected size of the secrets they detect. I might be missing some detectors that should implement this interface... i just can't think of them right now 😞
While I think this is a good idea, it would be a mistake to set a default max length. I can think of several detectors that can easily exceed this — JWT, private key, GCP, and Docker (#2677) to name a few. Because there are hundreds of detectors, the safer approach would be to make this opt-in.
This would also interfere with detectors that require multiple parts (e.g., client ID & secret, username & password, secret & URL).
Edit: Incidentally, this would likely solve #2739.
I think we can increase the default to 1024 bytes for the multi-part credential case. Should be adequate in most cases. I'd like to see this optimization on by default, but we can provide an opt-out flag.
I think we can increase the default to 1024 bytes for the multi-part credential case. Should be adequate in most cases. I'd like to see this optimization on by default, but we can provide an opt-out flag.
1024 bytes would be safer, but could definitely still miss valid secrets.
I think there'd be tremendous value in at least having a (hidden?) flag that runs both and checks for missed results (kind of like https://github.com/github/scientist), rather than binary on/off. Otherwise it can be tricky to identify affected detectors, which has been an issue with the verification overlap change.
I think we can increase the default to 1024 bytes for the multi-part credential case. Should be adequate in most cases. I'd like to see this optimization on by default, but we can provide an opt-out flag.
1024 bytes would be safer, but could definitely still miss valid secrets.
I think there'd be tremendous value in at least having a (hidden?) flag that runs both and checks for missed results (kind of like https://github.com/github/scientist), rather than binary on/off. Otherwise it can be tricky to identify affected detectors, which has been an issue with the verification overlap change.
@rgmz, this idea is implemented in https://github.com/trufflesecurity/trufflehog/pull/2918 Great suggestion!