oso icon indicating copy to clipboard operation
oso copied to clipboard

Briefing on dev dependency tracking

Open ccerv1 opened this issue 1 year ago • 1 comments

What is it?

How do we get all the dependency data and link it to projects?

ccerv1 avatar Oct 10 '24 20:10 ccerv1

@ccerv1 can you clarify the scope of this issue?

I think we still need an issue for GitHub repo => package dependencies. We only have package => package dependencies atm.

ryscheng avatar Oct 11 '24 14:10 ryscheng

Dependency Tracking Project Update

Objective

The main goal of this project is to create a comprehensive graph that shows:

  • Which developer tools are being used by Optimism projects (both onchain projects and other OSS tools)
  • When these projects started using these tools

From the graph, we can easily analyze downstream impact associated with these tools, including:

  • Number of dependents
  • Downstream gas fees
  • Downstream active developers
  • etc.

Requisite Data Sources

  1. deps.dev dataset:

    • Maintained by Google
    • Example: https://deps.dev/npm/ethers/6.13.4
    • Covers various package managers (NPM, Go, Maven, Python, Rust/Cargo)
    • Available in BigQuery (full dataset >40 terabytes)
  2. GitHub data:

    • Software Bill of Materials (SBOM) for relevant projects
    • Example: https://github.com/ethers-io/ethers.js/network/dependents
    • Shows which repositories import specific packages (e.g., Ethers) into their package.json or requirements
  3. Onchain project data:

    • List of projects identified as being part of Optimism (e.g., in OP Atlas)
    • See current collection here
    • Potential to expand to other relevant projects that haven't signed up for Retro Funding (e.g., all of OSO projects)
  4. NPM standard metrics:

    • Downloads
    • Releases
    • Example: https://npm-stat.com/charts.html?package=ethersjs&from=2023-10-15&to=2024-10-15

Progress

  1. Brought deps.dev into OSO

    • See PR here
    • Created a snapshot of the deps.dev dependency graph:
      • Focused on the last month of data; limited to first-level (direct) dependencies; currently only looking at NPM packages
      • Snapshot size: ~35 gigabytes
  2. Developed a preliminary deps.dev query to match Optimism projects with their dependencies:

    • Caveat: this only identifies projects with packages that include another package, so this doesn't capture applications that simply include the package in their dependencies
    • Ethers had 60 matches (out of 500+ Optimism projects on OSO)
    • Results include both web3-specific packages (e.g., Ethers, Viem, OpenZeppelin) and general-purpose packages (e.g., Axios, Lodash, React)
    • See spreadsheet here

Next Steps

  1. Incorporate GitHub SBOM data:

    • Extract and process SBOM data for relevant projects
    • Identify which repos using specific packages (e.g., Ethers) are associated with the Optimism ecosystem
  2. Improve project identification:

    • Review the matching process to identify more potential projects
    • Consider including projects with active contracts that aren't yet in OSO or OP Atlast
  3. Deduplicate results:

    • Remove duplicate entries pointing to the same GitHub organization (e.g., Ethers and Ethers project)
  4. Integrate NPM metrics:

    • Incorporate download statistics, release information, and other relevant metrics from NPM
  5. Expand data sources:

    • Consider including Python and Rust packages in the analysis
    • Explore ways to efficiently query the full deps.dev dataset for more comprehensive results
  6. Metrics implementation:

    • Create a long list of potential metrics
    • Work with badgeholders to refine list of metrics

ccerv1 avatar Oct 16 '24 12:10 ccerv1

Right so we have 2 pathways:

  1. We either try to integrate something like ecosyste.ms, that already has that data
  2. Or we try to get it ourselves from the GitHub API.

I think there's tractable implementation risk for both of these, both high execution risk for both

For (1), the data may not be as high quality as we want, it might be missing repos, the replication may not work (we've tried a couple times already and it's just really big) For (2), there's a higher implementation cost connecting DLT to this https://docs.github.com/en/rest/dependency-graph/sboms?apiVersion=2022-11-28#export-a-software-bill-of-materials-sbom-for-a-repository--fine-grained-access-tokens and similarly we don't know whats going to be on the other side. I've heard anecdotally GitHub doesn't detect every edge case correctly

this is a good example of something where I would not be making representations that this is easy or for sure going to land in a set time frame. We have to start working on de-risking the unknown unknowns The nice thing about (2), is that we can tell RF applicants, hey go look at your own GitHub insights and fix it if it's not being detected. They have extensive docs on how you can submit your own dependencies to their database https://docs.github.com/en/code-security/supply-chain-security/understanding-your-software-supply-chain/using-the-dependency-submission-api

ryscheng avatar Oct 16 '24 15:10 ryscheng