trufflehog icon indicating copy to clipboard operation
trufflehog copied to clipboard

[feat] - Optimize detector performance by reducing data passed to regex

Open ahrav opened this issue 9 months ago • 3 comments

Description:

This PR introduces optimizations to improve the performance of detectors by reducing the amount of data passed to the regex within the FromData method. The changes leverage the knowledge of keyword positions to extract relevant portions of the chunk data, where the secret is likely to reside.

Key changes:

  1. Introduced DetectorMatch struct to represent a detected pattern’s metadata, including the detector key, detector instance, and a slice of matchSpan structs representing the start and end offsets of matched keywords within the chunk.

  2. Added matchSpan struct to represent a single occurrence of a matched keyword, containing the start and end byte offsets within the chunk.

  3. Implemented Matches method for DetectorMatch to extract the relevant portions of the chunk data based on the start and end positions of each match. The end position is determined by taking the minimum of the keyword position + maxMatchLength (set to 300) and the length of the chunk data.

  4. Introduced FindDetectorMatches function (previously PopulateMatchingDetectors) to return a slice of DetectorMatch instances, each containing the detector key, detector, and a slice of matches. Adjacent or overlapping matches are merged using the mergeMatches function to avoid duplicating or overlapping the matched portions of the chunk data.

  5. Updated the detection logic to use the Matches method of DetectorMatch to extract the relevant portions of the chunk data before passing them to the FromData method of the detector.

  6. Introduced MaxSecretSizeProvider interface that detectors can optionally implement to provide a custom maximum size for the secrets they detect. The interface includes a single method ProvideMaxSecretSize() int64 that returns the maximum size of the secret the detector expects to find.

  7. As part of the FindDetectorMatches function, it checks if a detector implements the MaxSecretSizeProvider interface. If implemented, the ProvideMaxSecretSize method is called to obtain the detector-specific maximum secret size, which is used to determine the end position of the match span. If the interface is not implemented, the default maxMatchLength constant is used.

  8. Implemented the MaxSecretSizeProvider interface in the relevant detectors (PrivateKeyDetector, GCPDetector, and GCPApplicationDefaultCredentialsDetector) and provided appropriate values for the maximum secret size based on the expected size of the secrets they detect. I might be missing some detectors that should implement this interface... i just can't think of them right now 😞

The optimization is based on the assumption that most secrets shouldn’t exceed a certain length from the keyword’s position. By default, the maxMatchLength constant is set to 300 characters. However, detectors that require a larger or smaller max size can implement the MaxSecretSizeProvider interface and provide their own value through the ProvideMaxSecretSize method.

These changes significantly reduce the amount of data the regex within FromData has to process, leading to improved detector performance while still ensuring accurate secret detection. The introduction of the MaxSecretSizeProvider interface allows for flexibility in handling different secret sizes based on the specific requirements of each detector.

Sequence Diagram

optimized engine regex-2024-05-09-003033

Benchmarks

Benchmark assessing the performance of FromData with verification disabled across various chunk sizes.

orange: old chunk size (10kB) green: new chunk size (512B overestimate for most detectors) Screenshot 2024-05-08 at 8 50 56 AM

Checklist:

  • [ ] Tests passing (make test-community)?
  • [ ] Lint passing (make lint this requires golangci-lint)?

ahrav avatar May 09 '24 01:05 ahrav

The end position is determined by taking the minimum of the keyword position + maxMatchLength (set to 300) and the length of the chunk data.

... 8. Implemented the MaxSecretSizeProvider interface in the relevant detectors (PrivateKeyDetector, GCPDetector, and GCPApplicationDefaultCredentialsDetector) and provided appropriate values for the maximum secret size based on the expected size of the secrets they detect. I might be missing some detectors that should implement this interface... i just can't think of them right now 😞

While I think this is a good idea, it would be a mistake to set a default max length. I can think of several detectors that can easily exceed this — JWT, private key, GCP, and Docker (#2677) to name a few. Because there are hundreds of detectors, the safer approach would be to make this opt-in.

This would also interfere with detectors that require multiple parts (e.g., client ID & secret, username & password, secret & URL).


Edit: Incidentally, this would likely solve #2739.

rgmz avatar May 09 '24 01:05 rgmz

The end position is determined by taking the minimum of the keyword position + maxMatchLength (set to 300) and the length of the chunk data. ... 8. Implemented the MaxSecretSizeProvider interface in the relevant detectors (PrivateKeyDetector, GCPDetector, and GCPApplicationDefaultCredentialsDetector) and provided appropriate values for the maximum secret size based on the expected size of the secrets they detect. I might be missing some detectors that should implement this interface... i just can't think of them right now 😞

While I think this is a good idea, it would be a mistake to set a default max length. I can think of several detectors that can easily exceed this — JWT, private key, GCP, and Docker (#2677) to name a few. Because there are hundreds of detectors, the safer approach would be to make this opt-in.

This would also interfere with detectors that require multiple parts (e.g., client ID & secret, username & password, secret & URL).

Edit: Incidentally, this would likely solve #2739.

I think we can increase the default to 1024 bytes for the multi-part credential case. Should be adequate in most cases. I'd like to see this optimization on by default, but we can provide an opt-out flag.

dustin-decker avatar May 16 '24 21:05 dustin-decker

I think we can increase the default to 1024 bytes for the multi-part credential case. Should be adequate in most cases. I'd like to see this optimization on by default, but we can provide an opt-out flag.

1024 bytes would be safer, but could definitely still miss valid secrets.

I think there'd be tremendous value in at least having a (hidden?) flag that runs both and checks for missed results (kind of like https://github.com/github/scientist), rather than binary on/off. Otherwise it can be tricky to identify affected detectors, which has been an issue with the verification overlap change.

rgmz avatar May 16 '24 22:05 rgmz

I think we can increase the default to 1024 bytes for the multi-part credential case. Should be adequate in most cases. I'd like to see this optimization on by default, but we can provide an opt-out flag.

1024 bytes would be safer, but could definitely still miss valid secrets.

I think there'd be tremendous value in at least having a (hidden?) flag that runs both and checks for missed results (kind of like https://github.com/github/scientist), rather than binary on/off. Otherwise it can be tricky to identify affected detectors, which has been an issue with the verification overlap change.

@rgmz, this idea is implemented in https://github.com/trufflesecurity/trufflehog/pull/2918 Great suggestion!

dustin-decker avatar Jun 05 '24 11:06 dustin-decker