Briefing on dev dependency tracking
What is it?
How do we get all the dependency data and link it to projects?
@ccerv1 can you clarify the scope of this issue?
I think we still need an issue for GitHub repo => package dependencies. We only have package => package dependencies atm.
Dependency Tracking Project Update
Objective
The main goal of this project is to create a comprehensive graph that shows:
- Which developer tools are being used by Optimism projects (both onchain projects and other OSS tools)
- When these projects started using these tools
From the graph, we can easily analyze downstream impact associated with these tools, including:
- Number of dependents
- Downstream gas fees
- Downstream active developers
- etc.
Requisite Data Sources
-
deps.dev dataset:
- Maintained by Google
- Example: https://deps.dev/npm/ethers/6.13.4
- Covers various package managers (NPM, Go, Maven, Python, Rust/Cargo)
- Available in BigQuery (full dataset >40 terabytes)
-
GitHub data:
- Software Bill of Materials (SBOM) for relevant projects
- Example: https://github.com/ethers-io/ethers.js/network/dependents
- Shows which repositories import specific packages (e.g., Ethers) into their package.json or requirements
-
Onchain project data:
- List of projects identified as being part of Optimism (e.g., in OP Atlas)
- See current collection here
- Potential to expand to other relevant projects that haven't signed up for Retro Funding (e.g., all of OSO projects)
-
NPM standard metrics:
- Downloads
- Releases
- Example: https://npm-stat.com/charts.html?package=ethersjs&from=2023-10-15&to=2024-10-15
Progress
-
Brought deps.dev into OSO
- See PR here
- Created a snapshot of the deps.dev dependency graph:
- Focused on the last month of data; limited to first-level (direct) dependencies; currently only looking at NPM packages
- Snapshot size: ~35 gigabytes
-
Developed a preliminary deps.dev query to match Optimism projects with their dependencies:
- Caveat: this only identifies projects with packages that include another package, so this doesn't capture applications that simply include the package in their dependencies
- Ethers had 60 matches (out of 500+ Optimism projects on OSO)
- Results include both web3-specific packages (e.g., Ethers, Viem, OpenZeppelin) and general-purpose packages (e.g., Axios, Lodash, React)
- See spreadsheet here
Next Steps
-
Incorporate GitHub SBOM data:
- Extract and process SBOM data for relevant projects
- Identify which repos using specific packages (e.g., Ethers) are associated with the Optimism ecosystem
-
Improve project identification:
- Review the matching process to identify more potential projects
- Consider including projects with active contracts that aren't yet in OSO or OP Atlast
-
Deduplicate results:
- Remove duplicate entries pointing to the same GitHub organization (e.g., Ethers and Ethers project)
-
Integrate NPM metrics:
- Incorporate download statistics, release information, and other relevant metrics from NPM
-
Expand data sources:
- Consider including Python and Rust packages in the analysis
- Explore ways to efficiently query the full deps.dev dataset for more comprehensive results
-
Metrics implementation:
- Create a long list of potential metrics
- Work with badgeholders to refine list of metrics
Right so we have 2 pathways:
- We either try to integrate something like ecosyste.ms, that already has that data
- Or we try to get it ourselves from the GitHub API.
I think there's tractable implementation risk for both of these, both high execution risk for both
For (1), the data may not be as high quality as we want, it might be missing repos, the replication may not work (we've tried a couple times already and it's just really big) For (2), there's a higher implementation cost connecting DLT to this https://docs.github.com/en/rest/dependency-graph/sboms?apiVersion=2022-11-28#export-a-software-bill-of-materials-sbom-for-a-repository--fine-grained-access-tokens and similarly we don't know whats going to be on the other side. I've heard anecdotally GitHub doesn't detect every edge case correctly
this is a good example of something where I would not be making representations that this is easy or for sure going to land in a set time frame. We have to start working on de-risking the unknown unknowns The nice thing about (2), is that we can tell RF applicants, hey go look at your own GitHub insights and fix it if it's not being detected. They have extensive docs on how you can submit your own dependencies to their database https://docs.github.com/en/code-security/supply-chain-security/understanding-your-software-supply-chain/using-the-dependency-submission-api