git-of-theseus icon indicating copy to clipboard operation
git-of-theseus copied to clipboard

Some contributors appear several times under a different name

Open vhoulbreque opened this issue 6 years ago • 14 comments

I tried this project on https://github.com/vinzeebreak/ironcar

What I did:

git-of-theseus-analyze ironcar
git-of-theseus-stack-plot authors.json

And I get this:

stack_plot

But, several authors are the same person (and they appear under only one name in github's list of commits):

  • Houlbreque, Vincent Houlbrèque, Vinzeebreak and vinzeebreak
  • Hugo Masclet, Hugoo, Masclet Hugo

Shouldn't they appear under the same name ?

vhoulbreque avatar May 05 '18 16:05 vhoulbreque

I have the same issue. I tried working with the .mailmap file, but there is no difference.

andilar avatar Nov 29 '18 11:11 andilar

weird, i thought .mailmap would do the trick

feel free to investigate

erikbern avatar Nov 29 '18 13:11 erikbern

Ok thx. What I found out is, if you just have one entry in your .mailmap, it will be recognized. Also my output with git shortlog -sne is coming out correctly with a full blown .mailmap.

andilar avatar Nov 29 '18 14:11 andilar

weird, maybe gitpython doesn't parse .mailmap?

erikbern avatar Nov 29 '18 15:11 erikbern

No, they don't: https://github.com/gitpython-developers/GitPython/issues/764 But they also propose a solution...

tveon avatar Jun 06 '19 07:06 tveon

feel free to commit a fix for this!

erikbern avatar Jun 06 '19 14:06 erikbern

Does this problem persist? Any solution. I didn't understand if the .mailmap must be added on the git repo or can be used at the plot generation step

martinib77 avatar Dec 31 '19 14:12 martinib77

Pretty sure the problem still exists, so feel free to try to fix it!

erikbern avatar Jan 01 '20 18:01 erikbern

Workaround: Use this Javascript script to fix the authors.json file:

fix-authors.js

const fs = require("fs");
const authors = JSON.parse(fs.readFileSync("./authors.json"));

const labels = authors.labels;

const output = {
  ...authors,
};

const mailMap = {
  Houlbreque: "Hugo Masclet",
  "Hugo Masclet": "Hugo Masclet",
  Hugoo: "Hugo Masclet",
  "Masclet Hugo": "Hugo Masclet",
  "Vincent Houlbr\u00e8que": "Vincent Houlbr",
  Vinzeebreak: "Vincent Houlbr",
  adizout: "adizout",
  mathrb: "mathrb",
  srdadian: "srdadian",
  vinzeebreak: "Vincent Houlbr",
};

let memo = {},
  memoIndex = 0;

const map = labels.map((name, index) => {
  const toName = mailMap[name];

  if (!memo[toName]) {
    memo[toName] = memoIndex++;
  }
  return memo[toName];
});

output.y = output.y.reduce((output, item, index) => {
  const toMap = map[index];

  item.forEach((value, i2) => {
    output[toMap] = output[toMap] || [];
    output[toMap][i2] = output[toMap][i2] || 0;
    output[toMap][i2] += value;
  });

  return output;
}, []);

output.labels = Object.keys(memo);

fs.writeFileSync("./authors.out.json", JSON.stringify(output, null, 4));

Then you can plot with:

git-of-theseus-stack-plot authors.out.json --out stack.authors.png

dht avatar Mar 26 '22 17:03 dht

I tried @dht 's script, but ended up with some authors getting mixed up.

I wrote a comparable script in Python, that could probably be converted into a PR without too much effort (I just ran out of time to figure out how to integrate file paths with the CLI and the complexities of the analyze function)

Expand to see full script (120 lines)
"""
Aggregates contribution data from the `authors.json` file generated
by the `git-of-theseus` tool using an `authors_map.json` file.

The `authors_map.json` file must have the following format:
{
    "authorA": ["aliasA", "aliasA2", ...],
    "authorB": ["aliasB", "aliasB2", ...],
}
"""
import json


def read_authors_map(path):
    with open(path, "r") as f:
        authors_map = json.load(f)
    return authors_map


def read_authors_json(path):
    with open(path, "r") as aj:
        authors_json = json.load(aj)
    return authors_json


def parse_raw_contributions(authors_json):
    """
    The `authors.json` has the following format
    {
        "y": [
            [<line_count1>, <line_count2>, ...],
            [<line_count1>, <line_count2>, ...],
            ...
        ],
        "ts": ["date1", "date2", ...]
        "labels": ["aliasA", "aliasB", ...]
    }

    Each author's line count over time is stored separately
    from the author list. The association is made by index.

    This function parses the `authors.json` into the following
    format:
    {
        "aliasA": [<line_count1>, <line_count2>, ...],
        "aliasB": [<line_count1>, <line_count2>, ...],
        ...
    }
    """
    raw_contributions = {}
    for idx, alias in enumerate(authors_json["labels"]):
        raw_contributions[alias] = authors_json["y"][idx]
    return raw_contributions


def aggregate_contributions(authors_map, raw_contributions):
    """
    Aggregates the contribution data from each `alias` in the
    `raw_contributions` based on the `authors_map`.

    Returns a dictionary of the following format:
    {
        "authorA": [<line_count1>, <line_count2>, ...],
        "authorB": [<line_count1>, <line_count2>, ...],
    }
    where the values of each `author` are the sum of the contribution
    data for each author's corresponding aliases in the `authors_map`.

    For example, if the author `authorA` has aliases `aliasA` and `aliasA2`,
    and the `raw_contributions` data looks like this:
    {
        "aliasA": [10, 20],
        "aliasA2": [5, 20],
    }
    then the aggregated contribution data will look like this:
    {
        "authorA": [15, 40],
    }
    """
    contributions = {}
    for author, aliases in authors_map.items():
        alias_contributions = [
            raw_contributions[a] for a in aliases if a in raw_contributions
        ]
        if len(alias_contributions) > 0:
            contributions[author] = [
                sum(ac[idx] for ac in alias_contributions)
                for idx in range(len(alias_contributions[0]))
            ]

    return contributions


def format_new_authors_json(authors_map, authors_json, contributions):
    """
    Formats the `contributions` data into the `authors.json` format.
    """
    return {
        "y": [
            contributions[author]
            for author in authors_map.keys()
            if author in contributions
        ],
        "ts": authors_json["ts"],
        "labels": [author for author in authors_map.keys() if author in contributions],
    }


def write_authors_json(path, authors_json):
    with open(path, "w") as f:
        json.dump(authors_json, f)


if __name__ == "__main__":
    authors_map = read_authors_map("authors_map.json")
    authors_json = read_authors_json("authors.json")
    raw_contributions = parse_raw_contributions(authors_json)
    contributions = aggregate_contributions(authors_map, raw_contributions)
    new_authors_json = format_new_authors_json(authors_map, authors_json, contributions)
    write_authors_json("authors.out.json", new_authors_json)

thehale avatar Jul 08 '22 20:07 thehale

I think a mailmap file might resolve it, but I'm not sure

erikbern avatar Jul 08 '22 21:07 erikbern

@erikbern I tried:

  • adding a .mailmap file
  • checking it in (not certain this would be a requirement)
  • re-running git-of-theseus-analyze (not certain this would be a requirement)

But, the created graphs still don't disambiguate between authors using what is specified in .mailmap. I.e., it doesn't seem to work.

Whathecode avatar Jul 14 '22 14:07 Whathecode

@Whathecode It doesn't look like git-of-theseus currently considers a .mailmap when computing author statistics. I understood erikbern's comment to mean that he would prefer a solution based on parsing a .mailmap over my proposed solution which uses a custom JSON format.

thehale avatar Jul 14 '22 16:07 thehale

I thought .mailmap would maybe work through the git library that git-of-theseus uses

I guess not? Would be nice to support .mailmap files!

Thanks for checking @Whathecode – really appreciate it!

erikbern avatar Jul 14 '22 21:07 erikbern

I also just ran into this. The .mailmap issue is still unresolved at GitPython and apparently that repo is now in maintenance mode and no longer actively maintained.

Not sure if that means that dependency will ultimately need to be swapped out although I have no idea how big that job would be or what alternatives exist.

owenlamont avatar Feb 08 '23 11:02 owenlamont

@owenlamont The maintainer of GitPython actively responds to PRs, including PRs for new features (I had one merged in a few months ago). If someone contributed .mailmap support to GitPython I'm reasonably confident it would be accepted.

thehale avatar Feb 08 '23 16:02 thehale

Good to know, cheers. I kind of got mixed messages from the README as to how much it was still supported. I'll try to have a look at what is involved.

owenlamont avatar Feb 09 '23 11:02 owenlamont