django-DefectDojo icon indicating copy to clipboard operation
django-DefectDojo copied to clipboard

deduplication is not working

Open phuget opened this issue 7 months ago • 3 comments

Hey,

I have a problem with deduplication. I use the trivy-dojo-report-operator to import my reports to Defectdojo but I keep getting clones of vulnerabilities that only differ in creation-time and description.

I enabled deduplication in Defectdojo and set the max number of duplicates to 0. I think the issue could be the description-field. It contains our ressource-name which ends with a hash that changes every time we deploy. I already tried to change the deduplication algorithm. However nothing worked for me so far. Is there a workaround?

I looked into the logs of the deployed Defectdojo pods, but didn't see any errors.

Here are the values of one of the findings that have not been recognized as duplicates:

Title CVE-2024-7254 com.google.protobuf:protobuf-java 3.25.4 (same for both)
Productname: Testrun (same for both)
Servicename: Testrun (same for both)
Component Version: 3.25.4 (same for both)
Component Name  com.google.protobuf:protobuf-java (same for both)
Vulnerability Ids CVE-2024-7254 (same for both)
Severity: high (same for both)
Description:
      protobuf: StackOverflow vulnerability in Protocol Buffers (same for both)
      Fixed version: 3.25.5, 4.27.5, 4.28.2 (same for both)
      container.name: Testrun (same for both)
      resource.kind: ReplicaSet (same for both)
      resource.name: Testrun-5b66c55585 (---------------The hash is different between both--------------)
      resource.namespace: dev (same for both)

Defect-Dojo-Django Version Docker: 2.42.0-alpine Helm Version: 1.6.183

phuget avatar Apr 28 '25 08:04 phuget

The dedupe config for trivy operator by default:

"Trivy Operator Scan": ["title", "severity", "vulnerability_ids", "description"],

And recalculating the hash_codes via:

docker compose exec uwsgi /bin/bash -c "python manage.py dedupe.py --parser 'Trivy Operator Scan' --hash_code_only"

valentijnscholten avatar Apr 28 '25 15:04 valentijnscholten

Thanks @valentijnscholten, I'm a collegue of phuget. This seems to be working, I actually found this before your reply by reading up different issues on github and looking up linked markdown files. Might I suggest adding this information to the official documentation at the deduplication section here https://docs.defectdojo.com/en/working_with_findings/finding_deduplication/about_deduplication/

We had trouble understanding what parsers do, how they are connected to Tests and how Hashcodes are involved. It was not obvious, that the key of the parsers is connected to the "Test Type". I assumed it was a typo, since spaces in key-value mappings are rare. We configured the HASHCODE_FIELDS_PER_SCANNER value for "Trivy Operator Scan" without the "description" field and regenerated the hash_codes again.

All of this was not mentioned or linked in the documentation linked above.

We found the information we needed in this document and the subsequent chapters: https://github.com/DefectDojo/django-DefectDojo/blob/master/docs/content/en/open_source/archived_docs/usage/features.md#deduplication-algorithms My problem with the location is, that it is part of the "archived_docs" folder where I would assume the information to be outdated.

All in all we spent about 2-3 hours searching up on this.

MPritsch avatar Apr 29 '25 06:04 MPritsch

copying in @paulOsinski

valentijnscholten avatar Apr 29 '25 06:04 valentijnscholten

@MPritsch How well does the new hash_code configuration for Trivy Operator Scan work?

Tuning the deduplication settings is a bit of a two edges swords. It's nice to have flexibility, but if all users are changing these settings it becomes hard to provide support. Especially if users don't tell us they changed the settings.

But I do think it's a good idea to document it better, which is what we did in https://github.com/DefectDojo/django-DefectDojo/pull/13464

valentijnscholten avatar Oct 21 '25 05:10 valentijnscholten