scancode.io
scancode.io copied to clipboard
Update D2D purldb matching to pick best match
In the D2D pipeline, when we match a Resource or Directory against purldb, we return all packages that match to the Resource or Directory. This process should be improved upon, where we choose the best package match. This would reduce the number of extraneous results and provide more accurate results.
Talking with @pombredanne, we should start with figuring out how to best pick the best package match when we are matching package archives. A heuristic that can be used would be to consider the older package that was matched to an archive to be the definitive match.
For matching individual class files, in the case where we get multiple matches on a class file, we should consider the length (number of classpath segments) of the package namespace. The longer the maven package namespace, the more likely we have run into a package that's been repackaged. For example, say we have a class file that was matched to these two packages:
- https://repo1.maven.org/maven2/org/apache/axis/axis/1.4/axis-1.4.jar
- This would have a purl of
pkg:maven/org.apache.axis/[email protected]
- This would have a purl of
- https://repo1.maven.org/maven2/com/liferay/org.apache.axis/1.4.LIFERAY-PATCHED-7/org.apache.axis-1.4.LIFERAY-PATCHED-7.jar
- This would have a purl of
pkg:maven/com.liferay.org.apache.axis/[email protected]
- This would have a purl of
We would consider pkg:maven/org.apache.axis/[email protected]
to be the best match for the class file over pkg:maven/com.liferay.org.apache.axis/[email protected]
,