identity-matching icon indicating copy to clipboard operation
identity-matching copied to clipboard

Find a way to distinguish regular users from bots

Open zurk opened this issue 6 years ago • 7 comments

We can take some rule-based approach as a benchmark: email contains bot word or no-reply. However, there are emails like [email protected] that is hard to find. So some ML should be applied to find them. Commit-time-series features can be used.

zurk avatar Jul 08 '19 14:07 zurk

@warenlg did MVP of this feature (https://src-d.slack.com/archives/C7USX021L/p1563778058004300):

Why don't we remove bot, ci automated stuff etc from the identity matching table with simple regexp ? Right now, I might have 10% of bots in the cloudfroundry identities I'm working with for the demo, e.g.,

["log cache ci", "metric store ci", "loggregator ci",
                  "pivotal publication toolsmiths", "cf-infra-bot",
                  "cloud foundry buildpacks team robot",
                 "garden windows", "final release builder",
                 "pipeline", "flintstone ci", "capi ci",
                  "container networking bot", "cf mega bot",
                 "routing-ci", "cf bpm", "uaa identity bot",
                 "pcf backup & restore ci",
                 "ci bot", "cfcr ci bot", "cfcr"]

I removed 2.5k rows over 15k in total excluding name identities matching [^a-zA-Z]ci[^a-zA-Z]|bot$|pipeline|release|routing

@EgorBu as was discussed I assign this issue to you.

Regardless of an approach you choose, please create a list of filtered bots so we can also review them with eyes and see that we do not filter anything unrelated.

zurk avatar Jul 22 '19 11:07 zurk

Thanks K for filling the issue

warenlg avatar Jul 22 '19 13:07 warenlg

Related to https://github.com/src-d/eee-identity-matching/issues/30

vmarkovtsev avatar Jul 25 '19 15:07 vmarkovtsev

Current pattern: r"[^a-zA-Z|]ci\W|[\s-]ci\W|ci[\s-]|[\s-]ci[\s-]|bot$|pipeline|release|routing"

Problems with regexp:

('cici jiayi shen', '[email protected]'), 
('daniel adrian bohbot', '[email protected]'),
('horaci macias', '[email protected]'),
('melvindebot', '[email protected]'),
("daniel obot", "[email protected]")

Some French and Chinese names/surnames may look like bots for regexp

EgorBu avatar Jul 31 '19 08:07 EgorBu

@EgorBu Regarding French and Chinese, GitHub profiles often contain the country code. You can take the "users" table from GHTorrent and remove "bots" which have any country assigned.

vmarkovtsev avatar Jul 31 '19 08:07 vmarkovtsev

Ideas:

  • use regexp to find highly probable bots (19k found from 1300M rows author.date, author.email, author.name, committer.date, committer.email, committer.name)
  • calculate authors/committer fraction - it may show that distributions for normal users and bots are different
  • contribution activity - time & counts & repositories - it may show that distributions for normal users and bots are different
  • entropy of commit messages - idea that bots use heavily some patterns
  • intersection of name & repository contributed most
  • pretrained (or train on dataset) NN model to extract message embeddings + clustering for messages - if user messages are always from 1-2-3 clusters it could be a signal of bot
  • pretrained (or train on dataset) NN model to extract email/name embeddings + classification/clustering - it could be a good approach because we have quite a lot of bot names
  • use statistical features, messages, email/names as input for NN to make embeddings (triplet loss to make embeddings of bots closer to each other) + K nearest-neighbors search / classification

Updates:

  • launched pipeline for extraction statistics for bots - and it's slow (should be ~50 hours).
  • downloaded message dataset, reading about entropy measurements and other possible approaches
  • reading and thinking about ideas, coding

Next steps:

  • I will rewrite pipeline to use Spark - the task matches the map-reduce paradigm
  • Resave datasets as parquet/csv
  • launch pipeline for statistics
  • launch pipeline for entropy
  • intersection of name & repository contributed most

EgorBu avatar Aug 01 '19 08:08 EgorBu

There are at least several problems that may affect the quality:

  1. Noisy labels -
    • false positives from regexp - like: abbot, julia jenkins and so on
    • false negatives - not detected bots (gardener@tensorflow for example)
  2. Model input doesn't contain required info to make a correct prediction
    • false negatives - email doesn't contain bot related info, and the name contains. Ex: [email protected] / Egor's bot for deployment
  3. The name doesn't contain the required information to label it as bot
    • false positives - email contains bot related info, and the name doesn't contain. Ex: [email protected] / Egorka -> so it will be labeled as not bot and email tells that it's a bot
  4. Metrics. Deduplication:
    • deduplication is done by several fields - and if repository name is included - the quality could be found here - https://gist.github.com/EgorBu/a333409dfc12f89ac5fa1dc71461a3c0
    • it's higher than current - probably it could mean that standard names for bots are much more frequent - and in most of the cases standard names will be detected with high quality
  5. Metrics. Usage
    • we still don't have a clear understanding of how it should be applied (for each commit, for each identity, etc) - metrics should be selected on usage
  6. Dataset
    • Another possible reason that quality was higher here is some dataset issues

Hypothesis to check

  • metrics - clarify how to measure quality
  • Dataset
    • select a row in dataset
    • split dataset into 2 parts before some row and after
    • assign labels (0 - before, 1 - after)
    • train classifier - if the quality is better than random - something is fishy with dataset
  • false positives - email contains bot related info, and the name doesn't contain and false negatives - email doesn't contain bot related info, and the name contains
    • labels & predictions should be computed
    • extract features separately from names & emails
    • find nearest neighbors by name
    • find nearest neighbors by email
    • several situations are possible:
      • labels & predictions are the same among nearest neighbors for name & emails - perfect
      • labels among nearest neighbors for name are not the same - possible regexp mistakes?
      • predictions are not the same among nearest neighbors for emails - check it
      • labels & predictions are not the same among nearest neighbors for name & emails - possible regexp mistake?
  • model overfits to mistakes on regexp
    • hypothesis - number of mistakes is not so big
    • train several models on different chunks of data - it will reduce number of mistakes in each chunk
    • make voting among models when making prediction
    • focus on samples with different predictions and labels
    • focus on samples with different predictions
  • features are not good enough
    • BPE could extract features from abot as [a, bot] - and it will make almost impossible for model to differentiate one class from another
      • use token splitter to split [email protected] into [victor, abot, fr]
      • add a feature that will highlight if something is in the exception list
      • don't extract BPE features from exceptions

Papers:

EgorBu avatar Oct 11 '19 08:10 EgorBu