identity-matching
identity-matching copied to clipboard
source{d} extension to match Git signatures to real people.
Identity Matching source{d} Extension
Match different identities of the same person using 🤖. Extension for source{d}.
Overview • How To Use • Science • Contributions • License
Overview
People are using different e-mails and names (aka identities) when they commit their work to git. E-mails can be corporate, personal, special like users.noreply.github.com, etc. Names can be with Surname or without, with typos, no name, etc. Thus to get precise information about developer it is required to gather their identities and separate them from another person identities. That's what we call Identity Matching.
How To Use
Right now no pre-built binaries are available. Please refer to How to build from source code section to build an executable.
Run match-identities --help to see all the parameters that you can configure.
There are two use cases supported for match-identities.
- With gitbase
- Without gitbase
In both cases, the output identity table is saved as a Parquet file. Read more in the Output format section.
Use with gitbase
match-identities is supposed to be used with gitbase.
First of all, make sure you have a gitbase instance running with all the repositories you are going to analyze.
Please refer to the gitbase documentation to get more information.
Usage example:
match-identities --output matched_identities.parquet
The credentials can be configured with the --host, --port, --user and --password flags.
For example, the following SQL gitbase query will return the identities of each commit author:
SELECT DISTINCT repository_id, commit_author_name, commit_author_email
FROM commits;
If you want to cache the gitbase output you can use the --cache flag.
After the identities are fetched from gitbase, the matching process is run.
Read Science section to learn more.
Use without gitbase
If you run match-identities with the --cache option enabled you get a csv file with the cached gitbase output.
Besides, if you already have a list of identities it is possible to run match-identities without gitbase involved.
Create a CSV file with the columns repo, email and name, then feed it to the --cache parameter.
Usage Example:
match-identities \
--cache path/to/csv/file.csv \
--output matched_identities.parquet
Output format
Once the algorithm finishes to merge identities, you get a table with 4 columns:
id(int64) -- unique identifier of the person with the corresponding identity.email(utf8) -- e-mail of the identity.name(utf8) -- name of the identity.repo(utf8) -- repository of the commit.
The columns email, name and repo may contain empty values which means no constraints.
For example, let's consider this output identity table:
id,email,name,repo
1,[email protected],"",""
1,"",alice,""
2,[email protected],"",""
2,"",bob,""
2,[email protected],"",""
2,"",no-name,bob/bobs-project
There are two developers.
Let's name them Alice (with id 1) and Bob (with id 2).
When we analyze a commit with [email protected] as author email, then the author is Alice.
The repository and author name are ignored since the author email is the most reliable way to define an identity.
On the other hand, when we analyze a commit with alice as an author name, then the author is Alice for whatever combination of email and repository.
Same for Bob, although he uses two different email addresses [email protected] and [email protected].
If we come across a commit with the no-name author name in bob/bobs-project repository then it is Bob's.
Convert parquet to CSV
It is possible to convert the output parquet file to CSV using the python script in the research directory:
python3 ./research/parquet2csv.py matched_identities.parquet
The result will be saved as matched_identities.csv.
Please note that pyspark must be installed.
External matching option
If the organization is using GitHub, Gitlab or Bitbucket, it is possible to use their API to match identities by emails. In that case, 2 columns are added and filled for every email in the table: the External id provider and the External id itself.
How to build
git clone https://github.com/src-d/identity-matching
cd identity-matching
make build
You'll see two directories with Linux and Macos binaries inside the build directory.
Science
There are two stages to match identities. The first is the precomputation which is run once on the whole dataset and remains unchanged during the subsequent steps. The second is the matching itself.
- Precomputation:
- Gather 2 lists of the most popular names and emails (by frequencies) on the whole dataset.
- Gather 2 lists of emails and names that will be ignored (aka blacklists) on the whole dataset. They are non-human identities and usually related to CI, bots, etc.
- Analysis:
- Gather the list of triplets
{email, name, repository}from all the commits using gitbase. - Remove any triplet whose name or email belongs to the blacklists.
- Merge identities with the same e-mail if it doesn't belong to the list of popular emails created in 1.1.
- Merge identities with the same name if it doesn't belong to the list of popular names created in 1.1.
When the name belongs to this list we replace it with the following tuple
(name, repository). - Save the resulting identity table in the desired output format.
- Gather the list of triplets
There is a Design Document (or a Blueprint, or whatever else you are used to call project documentation) which goes into more detail: link.
Contributions
...are welcome! See CONTRIBUTING and code of conduct.
License
GPL 3.0, see LICENSE. Y u no Apache/MIT? Read here.