django-DefectDojo
django-DefectDojo copied to clipboard
Deduplication for SARIF import
DefectDojo's SARIF parser does not take into account values provided in the partialFingerprints property for deduplication. It would be nice if it did. The legacy deduplication algorithm used by the SARIF parser includes the line number of the finding, making it sensitive to unrelated code changes.
As a first cut, it would be helpful if the SARIF parser hashed all of the partialFingerprints into the finding's finding.unique_id_from_tool value and switch to using the DEDUPE_ALGO_UNIQUE_ID_FROM_TOOL_OR_HASH_CODE algorithm (falling back on the legacy fields if no fingerprints are provided).
In a more sophisticated implementation, it would be helpful to be able to have the caller of the import API to specify which fingerprints in the SARIF file they would like to include in the hash, as we have seen that the quality of fingerprints varies within a single tool. For example, one tool provides two fingerprint for every result, one of which is sensitive to line numbers.
Some tools, contrary to the SARIF specification, put their fingerprinting information in the fingerprints property instead of the partialFingerprints one. It would be helpful if the SARIF parser could handle this case, too, but it isn't clear what the general solution for this might be.
The SARIF spec gives some guidelines for how to make use of fingerprinting information generated by tools in Appendix B. (Normative) Use of fingerprints by result management systems.
@kdyck-cb I think you're right. The de-duplication should use data from the fingerprints properties, will take a look to how to do that.
The only pb with this is that some tools doesn't provide fingerprints it's not aligned with how Dojo works :/
One way to do it could be to use the finger printing feature when it's here and set attribute finding.unique_id_from_tool
This way, the administrator can customize by tools in the settings directly.
@kdyck-cb I'm doing some experiments today. but I can say that the data model of SARIF and the one for dojo is not aligned.
So your idea of putting a hash of Fingerprints data in unique_id_from_tool make sense.
One other bit of complexity that might be worth keeping in mind here is that those fingerprint properties in the SARIF format are meant to be versioned. So the default behaviour we probably want is to include only the latest version of each fingerprint in that hash.
Exactly what I through, this what I implemented. I tested a lot of tools (more than 12) and I didn't found a report with multiple version.
@kdyck-cb FYI: #6687
I left a comment on the PR (a little late, it seems, as it has already been merged). It looks like the implementation is making a couple assumptions when parsing the hierarchical key strings that may not be true in all situations. I suppose we can wait and see what happens in practice.
Will read your comments but I can share that there is unit tests with reports from more than 10 different tools. So it should cover most of real data use cases.
I guess this can be closed as well @mtesauro