scancode.io icon indicating copy to clipboard operation
scancode.io copied to clipboard

Multi-level matching and lookup for package data

Open pombredanne opened this issue 1 year ago • 0 comments

Assuming that there are multiple sources of curated, corrected or reviewed, I would like to have a pipeline that works with the PurlDB and these other sources of curated data.

  • A first pipeline step would scan with ScanCode TK or match for packages with PurlDB.
  • Then a second step would look a source of curated package data for each PURL and replace the package data with these data if the PURL is found.

It may be possible to invert the steps: for instance if this package data source supports some form of matching. For instance, when using ABOUT files as a source we can match based on paths or checksums. In this case it may be useful to perform this step first before matching to the PurlDB.

With this proposed approach, it will be possible to keep the PurlDB as a reference and the data source for matching and have a way to override, complete, correct or provide preferences (such as a license choice) from a curated data source.

As an example, the packages at https://repo1.maven.org/maven2/antlr/antlr/2.7.5/ have no license. An ABOUT or another datasource may have a proper license such as antlr-pd found from research. And this would allow to access this data and fill in the gaps.

pombredanne avatar Nov 01 '23 22:11 pombredanne