identity-matching Find a way to distinguish regular users from bots

We can take some rule-based approach as a benchmark: email contains bot word or no-reply. However, there are emails like [email protected] that is hard to find. So some ML should be applied to find them. Commit-time-series features can be used.

Jul 08 '19 14:07 zurk

@warenlg did MVP of this feature (https://src-d.slack.com/archives/C7USX021L/p1563778058004300):

Why don't we remove bot, ci automated stuff etc from the identity matching table with simple regexp ? Right now, I might have 10% of bots in the cloudfroundry identities I'm working with for the demo, e.g.,

["log cache ci", "metric store ci", "loggregator ci",
                  "pivotal publication toolsmiths", "cf-infra-bot",
                  "cloud foundry buildpacks team robot",
                 "garden windows", "final release builder",
                 "pipeline", "flintstone ci", "capi ci",
                  "container networking bot", "cf mega bot",
                 "routing-ci", "cf bpm", "uaa identity bot",
                 "pcf backup & restore ci",
                 "ci bot", "cfcr ci bot", "cfcr"]

I removed 2.5k rows over 15k in total excluding name identities matching [^a-zA-Z]ci[^a-zA-Z]|bot$|pipeline|release|routing

@EgorBu as was discussed I assign this issue to you.

Regardless of an approach you choose, please create a list of filtered bots so we can also review them with eyes and see that we do not filter anything unrelated.

Jul 22 '19 11:07 zurk

Thanks K for filling the issue

Jul 22 '19 13:07 warenlg

Related to https://github.com/src-d/eee-identity-matching/issues/30

Jul 25 '19 15:07 vmarkovtsev

Problems with regexp:

('cici jiayi shen', '[email protected]'), 
('daniel adrian bohbot', '[email protected]'),
('horaci macias', '[email protected]'),
('melvindebot', '[email protected]'),
("daniel obot", "[email protected]")

Some French and Chinese names/surnames may look like bots for regexp

Jul 31 '19 08:07 EgorBu

@EgorBu Regarding French and Chinese, GitHub profiles often contain the country code. You can take the "users" table from GHTorrent and remove "bots" which have any country assigned.

Jul 31 '19 08:07 vmarkovtsev

Ideas:

use regexp to find highly probable bots (19k found from 1300M rows author.date, author.email, author.name, committer.date, committer.email, committer.name)
calculate authors/committer fraction - it may show that distributions for normal users and bots are different
contribution activity - time & counts & repositories - it may show that distributions for normal users and bots are different
entropy of commit messages - idea that bots use heavily some patterns
intersection of name & repository contributed most
pretrained (or train on dataset) NN model to extract message embeddings + clustering for messages - if user messages are always from 1-2-3 clusters it could be a signal of bot
pretrained (or train on dataset) NN model to extract email/name embeddings + classification/clustering - it could be a good approach because we have quite a lot of bot names
use statistical features, messages, email/names as input for NN to make embeddings (triplet loss to make embeddings of bots closer to each other) + K nearest-neighbors search / classification

Updates:

launched pipeline for extraction statistics for bots - and it's slow (should be ~50 hours).
downloaded message dataset, reading about entropy measurements and other possible approaches
reading and thinking about ideas, coding

Next steps:

I will rewrite pipeline to use Spark - the task matches the map-reduce paradigm
Resave datasets as parquet/csv
launch pipeline for statistics
launch pipeline for entropy
intersection of name & repository contributed most

Aug 01 '19 08:08 EgorBu

There are at least several problems that may affect the quality:

Noisy labels -
- false positives from regexp - like: abbot, julia jenkins and so on
- false negatives - not detected bots (gardener@tensorflow for example)
Model input doesn't contain required info to make a correct prediction
- false negatives - email doesn't contain bot related info, and the name contains. Ex: [email protected] / Egor's bot for deployment
The name doesn't contain the required information to label it as bot
- false positives - email contains bot related info, and the name doesn't contain. Ex: [email protected] / Egorka -> so it will be labeled as not bot and email tells that it's a bot
Metrics. Deduplication:
- deduplication is done by several fields - and if repository name is included - the quality could be found here - https://gist.github.com/EgorBu/a333409dfc12f89ac5fa1dc71461a3c0
- it's higher than current - probably it could mean that standard names for bots are much more frequent - and in most of the cases standard names will be detected with high quality
Metrics. Usage
- we still don't have a clear understanding of how it should be applied (for each commit, for each identity, etc) - metrics should be selected on usage
Dataset
- Another possible reason that quality was higher here is some dataset issues

Hypothesis to check

metrics - clarify how to measure quality
Dataset
- select a row in dataset
- split dataset into 2 parts before some row and after
- assign labels (0 - before, 1 - after)
- train classifier - if the quality is better than random - something is fishy with dataset
false positives - email contains bot related info, and the name doesn't contain and false negatives - email doesn't contain bot related info, and the name contains
- labels & predictions should be computed
- extract features separately from names & emails
- find nearest neighbors by name
- find nearest neighbors by email
- several situations are possible:
  - labels & predictions are the same among nearest neighbors for name & emails - perfect
  - labels among nearest neighbors for name are not the same - possible regexp mistakes?
  - predictions are not the same among nearest neighbors for emails - check it
  - labels & predictions are not the same among nearest neighbors for name & emails - possible regexp mistake?
model overfits to mistakes on regexp
- hypothesis - number of mistakes is not so big
- train several models on different chunks of data - it will reduce number of mistakes in each chunk
- make voting among models when making prediction
- focus on samples with different predictions and labels
- focus on samples with different predictions
features are not good enough
- BPE could extract features from abot as [a, bot] - and it will make almost impossible for model to differentiate one class from another
  - use token splitter to split [email protected] into [victor, abot, fr]
  - add a feature that will highlight if something is in the exception list
  - don't extract BPE features from exceptions

Papers:

Oct 11 '19 08:10 EgorBu