package-feeds Aggregation of distro and pkg data sets to create a searchable DB

Background: As more vulnerabilities to continue to be discovered in packages and libraries that are present in various distributions, practitioners working across their organizations need a single place to query for a particular dependency, package, or other component and discover which distributions and their version contain that (or vice versa).

Comparable Queries: The following is a variety of various tools or resources that have some functionality of a desired search tool (in order of best match to the use case described in the background):

Proposal: Ideally there'd be a single system which supports libraries.io and pkgs.org. pkgs.org API access requires a membership and may be worth the OpenSSF funding in order to query both APIs and bring them into a single too (or other financial offset to allow the pkgs.org API to be free). We should look to include as many distros in this central tool.

Why this issue on package-feeds?
It is unclear what is the best group to tackle this project, given package feed appears to have initial functionality, this issue is being submitted as best-possible-match for a home this could be created under or as an extension to.

For more information on the discussion that sparked this issue: https://openssf.slack.com/archives/C019M98JSHK/p1657119043352399

Jul 06 '22 16:07 TheFoxAtWork

Commented in the thread, but repeating here:

I think a solution which incorporates existing systems, rather than building a new package-finding system from scratch, is definitely the ideal. It would also enable us to cast the widest net in supporting many different platforms (both OS and language package managers). That said, imagining a tool which queries multiple sources, we'd want to be clear in the UI where information is being sourced. So if a result for a package comes from, say, libraries.io, the end-user should be informed.

Jul 06 '22 16:07 alilleybrinker

💯

Jul 06 '22 16:07 TheFoxAtWork

Only tangentially related, https://github.com/ossf/wg-securing-critical-projects/issues/41. There is some overlap with the component/threat intelligence elements in certain commercial vendors/offerings, so it'd be interesting to ask members and commercial entities more broadly about this, too. Also https://ossindex.sonatype.org/, https://deps.dev/, and here's a list of by-hash links I collected a while ago over at https://github.com/bureado/awesome-software-supply-chain-security#dependency-intelligence:

Online services that help understand what a specific dependency is, or at least whether it's known (usually feeding it a package identifier, such as purl, CPE or another form of ecosystem:name:version, or alternatively via hash):
- NSRL: hashes for COTS software, well-integrated in tooling from sleuthkit/hfind to nsrllookup
- A source that can be queried via a public API (HTTP and DNS!) and can be more open source-aware is CIRCL hashlookup
- Repology has legendary coverage for Linux packages across multiple distribution; its repology-updater and other infrastructure pieces are open source. It provides an updater for WikiData which also has properties of interest for the supply chain security domain.
- Tidelift's libraries.io provides an API and supports over 30 package ecosystems
- WhiteSource's Unified Agent also offers some sophisticated file matching abilities
- The Software Heritage Project has massive ingestion capabilities and offers an API which can efficiently check whether a hash is known, and provide certain information on the file if so
- hashdd - Known Good Cryptographic Hashes
- ClearlyDefined provides licensing information for open source components, given their coordinates
- LGTM - Code Analysis Platform to Find and Prevent Vulnerabilities allows manually searching by GitHub repo
- Binary Transparency offers an API that allows to search packages by hash and other attributes
  - A somehow related read is the second half of How Cloudflare verifies the code WhatsApp Web serves to users

It'd be good to model the query keys. Should we expect to pass a string, or a purl and it'll give us CPEs? Or a hash, or a filename and it gives us purls? Or will it help us normalize a partial search? See https://github.com/repology/repology-rules. And what kind of information about a package? For example, I don't think Repology would give us e.g., debtags or buildinfo files that we could bring in from dedicated Debian infrastructure, or even what the UDD does (I'm sure there are similar data sources for OBS, Koji, etc.)

Edit: forgot https://artifacthub.io/docs/topics/repositories/

Jul 06 '22 23:07 bureado

If helpful, you're welcome to leverage the logic (or implementation) we built into https://github.com/Microsoft/OSSGadget, which handles at least some of this abstraction.

Jul 08 '22 17:07 scovetta

package-feeds package-feeds copied to clipboard

Aggregation of distro and pkg data sets to create a searchable DB

package-feeds
package-feeds copied to clipboard