django-DefectDojo icon indicating copy to clipboard operation
django-DefectDojo copied to clipboard

Deduplication for SARIF import

Open kdyck-cb opened this issue 3 years ago • 7 comments

DefectDojo's SARIF parser does not take into account values provided in the partialFingerprints property for deduplication. It would be nice if it did. The legacy deduplication algorithm used by the SARIF parser includes the line number of the finding, making it sensitive to unrelated code changes.

As a first cut, it would be helpful if the SARIF parser hashed all of the partialFingerprints into the finding's finding.unique_id_from_tool value and switch to using the DEDUPE_ALGO_UNIQUE_ID_FROM_TOOL_OR_HASH_CODE algorithm (falling back on the legacy fields if no fingerprints are provided).

In a more sophisticated implementation, it would be helpful to be able to have the caller of the import API to specify which fingerprints in the SARIF file they would like to include in the hash, as we have seen that the quality of fingerprints varies within a single tool. For example, one tool provides two fingerprint for every result, one of which is sensitive to line numbers.

Some tools, contrary to the SARIF specification, put their fingerprinting information in the fingerprints property instead of the partialFingerprints one. It would be helpful if the SARIF parser could handle this case, too, but it isn't clear what the general solution for this might be.

The SARIF spec gives some guidelines for how to make use of fingerprinting information generated by tools in Appendix B. (Normative) Use of fingerprints by result management systems.

kdyck-cb avatar Jul 25 '22 13:07 kdyck-cb

@kdyck-cb I think you're right. The de-duplication should use data from the fingerprints properties, will take a look to how to do that. The only pb with this is that some tools doesn't provide fingerprints it's not aligned with how Dojo works :/ One way to do it could be to use the finger printing feature when it's here and set attribute finding.unique_id_from_tool This way, the administrator can customize by tools in the settings directly.

damiencarol avatar Jul 25 '22 14:07 damiencarol

@kdyck-cb I'm doing some experiments today. but I can say that the data model of SARIF and the one for dojo is not aligned. So your idea of putting a hash of Fingerprints data in unique_id_from_tool make sense.

damiencarol avatar Aug 10 '22 08:08 damiencarol

One other bit of complexity that might be worth keeping in mind here is that those fingerprint properties in the SARIF format are meant to be versioned. So the default behaviour we probably want is to include only the latest version of each fingerprint in that hash.

kdyck-cb avatar Aug 10 '22 11:08 kdyck-cb

Exactly what I through, this what I implemented. I tested a lot of tools (more than 12) and I didn't found a report with multiple version.

damiencarol avatar Aug 10 '22 16:08 damiencarol

@kdyck-cb FYI: #6687

damiencarol avatar Aug 10 '22 16:08 damiencarol

I left a comment on the PR (a little late, it seems, as it has already been merged). It looks like the implementation is making a couple assumptions when parsing the hierarchical key strings that may not be true in all situations. I suppose we can wait and see what happens in practice.

kdyck-cb avatar Aug 12 '22 11:08 kdyck-cb

Will read your comments but I can share that there is unit tests with reports from more than 10 different tools. So it should cover most of real data use cases.

damiencarol avatar Aug 13 '22 14:08 damiencarol

I guess this can be closed as well @mtesauro

manuel-sommer avatar Jan 15 '24 20:01 manuel-sommer