Repository clustering analysis for EF
What is it?
Context
We built an initial Hex dashboard that identifies repos in the Ethereum ecosystem
We received the following feedback:
I was wondering if you already had thoughts on how to cluster repos in some unsupervised manner and then label them, or if you think we need to start with categories and find the repos that belong in them.
For the former, especially if you are using dependency graphs from some set of 'root repos' (like you did with deep funding?), I've seen these 'dependency structure matrices' which may be of some help: https://docs.lattix.com/lattix/modelingComplexSystems/ModelingComplexSystems.html
If we are starting with a set of repos I thought it would be good to discuss what a good set of categories would be, and how we would manage the list going into the future.
Next Steps
We should do the following:
- Experiment with using pyoso in Colab with Gemini AI
- Implement a variety of clustering approaches and share results with EF
- Create a tutorial for others to use pyoso in Colab for ML explorations
Notebook: https://colab.research.google.com/drive/1GbesuJLalkTUCHQlHuRQFyGjvvfWql5h?authuser=2
I still have to:
- clean up my local code and push it to insights
- Formalize a conclusion
Per conversation w/ Carl, I'm going to create an initial version of this: https://docs.lattix.com/lattix/modelingComplexSystems/ModelingComplexSystems.html, using the Devtooling categories
PR (clustering repos by DSM): https://github.com/opensource-observer/insights/pull/174
Per call yesterday:
- Want something more like the AI categorizations that we did for Optimism
- Main pain point is understanding trends for the top X builders across each category