spamtoberfest
spamtoberfest copied to clipboard
Add score-based spam detection besides blacklist
-
I'm submitting a ... [ ] spammer report [ ] bug report [X] feature request [ ] question about the decisions made in the repository [ ] question about how to use this project
-
Summary
It seems that this tool is only a simple blacklist - but I think some kind of negative scoring system may be introduced.
- Other information (e.g. detailed explanation, stack traces, related issues, suggestions how to fix, links for us to have context, eg. StackOverflow, personal fork, etc.)
I believe we can score PRs negatively (and positively) and mark as spam if a defined threshold is met. For example some things deducting score may be:
- Changes only in text files (.md, .html).
- Changes only in one file (or removal a single file),
- Changes only in one line,
- Changes consisting of words "awesome" or "amazing" ;) (aka: blacklisting words in commits messages and diffs themselves),
- Empty descriptions,
- "patch-1" as a name of remote branch.
Of course, it's not the best solution, as it won't be 100% bulletproof, but what do you think?
That is not really possible. someone changing 4 words with actual typo's can be a sincere pull request, and if I would make a sincere pull request and get labelled as spam right away I'm not sure if I would spent my time on a project like that.
In my mind most (or all, or configurable number of) checks must be met to PR be marked as spam, so legitimate correcting of typos shouldn't trigger anything.
Thank you for your contribution @ktos! As said by @StefanJanssen95 we need to refine the criteria to detect sincere PRs and spam PRs the more accurate way possible.
As planned in #1, if we are detaching the blacklist from the build and make it an external JSON file, we absolutely can add some more details and indicators attached to a user.
It implies defining a new and more complete model based on criteria we would use to compute a trust-score. In addition, it would be nice to let the user configure its own minimal allowed threshold in the GitHub Action. Feel free to make some suggestions, propose some code and make some PRs.
I think this is a cool idea, but hard to implement. Someone may train a neural network on the database to find some relation between the information and the spamminess of a PR.
If implemented the intelligent algorithm should not close the PRs automatically but labeling the PR as "possibly not following the standards".