oso
oso copied to clipboard
Create a training set of labeled repos for DDP
We'll iterate on these with DDP, but some initial ideas:
- "abandoned" - project was started and quickly abandoned
- "duplicate" - project is double-counted or a fork that doesn't not deviate from the main project
- "false positive" - project does not seem like it belongs in the dataset
- "spammy" - project has a lot of bot-like activity or other signs of manufactured activity
- "high quality missing from OSO" - high quality projects that are missing from OSSD