Find a way to distinguish regular users from bots
We can take some rule-based approach as a benchmark: email contains bot word or no-reply. However, there are emails like [email protected] that is hard to find. So some ML should be applied to find them. Commit-time-series features can be used.
@warenlg did MVP of this feature (https://src-d.slack.com/archives/C7USX021L/p1563778058004300):
Why don't we remove bot, ci automated stuff etc from the identity matching table with simple regexp ? Right now, I might have 10% of bots in the cloudfroundry identities I'm working with for the demo, e.g.,
["log cache ci", "metric store ci", "loggregator ci",
"pivotal publication toolsmiths", "cf-infra-bot",
"cloud foundry buildpacks team robot",
"garden windows", "final release builder",
"pipeline", "flintstone ci", "capi ci",
"container networking bot", "cf mega bot",
"routing-ci", "cf bpm", "uaa identity bot",
"pcf backup & restore ci",
"ci bot", "cfcr ci bot", "cfcr"]
I removed 2.5k rows over 15k in total excluding name identities matching [^a-zA-Z]ci[^a-zA-Z]|bot$|pipeline|release|routing
@EgorBu as was discussed I assign this issue to you.
Regardless of an approach you choose, please create a list of filtered bots so we can also review them with eyes and see that we do not filter anything unrelated.
Thanks K for filling the issue
Related to https://github.com/src-d/eee-identity-matching/issues/30
Current pattern: r"[^a-zA-Z|]ci\W|[\s-]ci\W|ci[\s-]|[\s-]ci[\s-]|bot$|pipeline|release|routing"
Problems with regexp:
('cici jiayi shen', '[email protected]'),
('daniel adrian bohbot', '[email protected]'),
('horaci macias', '[email protected]'),
('melvindebot', '[email protected]'),
("daniel obot", "[email protected]")
Some French and Chinese names/surnames may look like bots for regexp
@EgorBu Regarding French and Chinese, GitHub profiles often contain the country code. You can take the "users" table from GHTorrent and remove "bots" which have any country assigned.
Ideas:
- use regexp to find highly probable bots (19k found from 1300M rows
author.date, author.email, author.name, committer.date, committer.email, committer.name) - calculate authors/committer fraction - it may show that distributions for normal users and bots are different
- contribution activity - time & counts & repositories - it may show that distributions for normal users and bots are different
- entropy of commit messages - idea that bots use heavily some patterns
- intersection of name & repository contributed most
- pretrained (or train on dataset) NN model to extract message embeddings + clustering for messages - if user messages are always from 1-2-3 clusters it could be a signal of bot
- pretrained (or train on dataset) NN model to extract email/name embeddings + classification/clustering - it could be a good approach because we have quite a lot of bot names
- use statistical features, messages, email/names as input for NN to make embeddings (triplet loss to make embeddings of bots closer to each other) + K nearest-neighbors search / classification
Updates:
- launched pipeline for extraction statistics for bots - and it's slow (should be ~50 hours).
- downloaded message dataset, reading about entropy measurements and other possible approaches
- reading and thinking about ideas, coding
Next steps:
- I will rewrite pipeline to use Spark - the task matches the map-reduce paradigm
- Resave datasets as
parquet/csv - launch pipeline for statistics
- launch pipeline for entropy
- intersection of name & repository contributed most
There are at least several problems that may affect the quality:
- Noisy labels -
- false positives from regexp - like:
abbot,julia jenkinsand so on - false negatives - not detected bots (
gardener@tensorflowfor example)
- false positives from regexp - like:
- Model input doesn't contain required info to make a correct prediction
- false negatives - email doesn't contain bot related info, and the name contains. Ex:
[email protected] / Egor's bot for deployment
- false negatives - email doesn't contain bot related info, and the name contains. Ex:
- The name doesn't contain the required information to label it as bot
- false positives - email contains bot related info, and the name doesn't contain. Ex:
[email protected] / Egorka-> so it will be labeled as not bot and email tells that it's a bot
- false positives - email contains bot related info, and the name doesn't contain. Ex:
- Metrics. Deduplication:
- deduplication is done by several fields - and if
repositoryname is included - the quality could be found here - https://gist.github.com/EgorBu/a333409dfc12f89ac5fa1dc71461a3c0 - it's higher than current - probably it could mean that standard names for bots are much more frequent - and in most of the cases standard names will be detected with high quality
- deduplication is done by several fields - and if
- Metrics. Usage
- we still don't have a clear understanding of how it should be applied (for each commit, for each identity, etc) - metrics should be selected on usage
- Dataset
- Another possible reason that quality was higher here is some dataset issues
Hypothesis to check
- metrics - clarify how to measure quality
- Dataset
- select a row in dataset
- split dataset into 2 parts before some row and after
- assign labels (0 - before, 1 - after)
- train classifier - if the quality is better than random - something is fishy with dataset
-
false positives - email contains bot related info, and the name doesn't containandfalse negatives - email doesn't contain bot related info, and the name contains- labels & predictions should be computed
- extract features separately from names & emails
- find nearest neighbors by name
- find nearest neighbors by email
- several situations are possible:
- labels & predictions are the same among nearest neighbors for name & emails - perfect
- labels among nearest neighbors for name are not the same - possible regexp mistakes?
- predictions are not the same among nearest neighbors for emails - check it
- labels & predictions are not the same among nearest neighbors for name & emails - possible regexp mistake?
- model overfits to mistakes on regexp
- hypothesis - number of mistakes is not so big
- train several models on different chunks of data - it will reduce number of mistakes in each chunk
- make voting among models when making prediction
- focus on samples with different predictions and labels
- focus on samples with different predictions
- features are not good enough
- BPE could extract features from
abotas[a, bot]- and it will make almost impossible for model to differentiate one class from another- use
token splitterto split[email protected]into[victor, abot, fr] - add a feature that will highlight if something is in the exception list
- don't extract BPE features from exceptions
- use
- BPE could extract features from