incubator-devlake [Bug] [Database] author_id column in commits table of database contains an email instead of an id

Search before asking

[X] I had searched in the issues and found no similar issues.

What happened

Exploring the structure of the database and the content of the tables once we scanned few repositories, we realized that the values of the column author_id in the table commits, contains the same value of column author_email. We compared this with another table and in pull_requests table there is a column also called author_id, but its content is the unique GitHub id of the user that created the pull request.

What you expected to happen

We expected that the content of author_id in Commits had the same value as author_id in pull_requests (ex: github:GithubUser:11111111). The design of the table commits contains two columns with the same values which means had duplicated values in the table, and it is not an ideal design. Have this unique identifier in the table commits will help to create dashboards in Grafana to obtain more exactly metrics.

Without an unique author_id on commits table, it's not possible obtain all the commits of an individual as now each combination of email + display name makes a different commiter.

Actual behavior:

author_name	author_email	author_id
Jon Doe	[email protected]	[email protected]
Jon Doe	[email protected]	[email protected]

(Both emails belong to the same person)

Expected behavior:

author_name	author_email	author_id
Jon Doe	[email protected]	github:GithubUser:11111111
Jon Doe	[email protected]	github:GithubUser:11111111

How to reproduce

Have a GitHub connection configured with a token
Go to Pipelines > Create Pipeline Run.
Click on Create Pipeline Run.
Scroll down to until 'Github' is shown in Data Providers list.
Toggle on GitHub Data provider
Enter repository owner and name for a repository that contains few commits and pull requests created by different users.
Click on 'Run Pipeline'
Once the register of the repository have finished, go to Pipelines > Create Pipeline Run.
Click on Create Pipeline Run
Scroll down to until 'Advanced Mode' option at the bottom appears.
Click on 'Advanced Mode'.
Create a task in the task editor to launch a GitHub Extractor Task.
Use the following JSON: [ [ { "Plugin": "gitextractor", "Options": { "url": "Url of the repository registered in the previous step ended with .git", "repoId": "Github repository id. It looks like -> github:GithubRepo:384111310", "user": "Name of the user who is the owner of the GitHub Token", "password": "GitHub Token" } } ]
Click on 'Run Pipeline'
Once the scan have finished, connect to the database
Use the table commits
Execute the query: SELECT author_email, author_id FROM lake.commits;

Anything else

Not sure about the exact version of lake we are using, because we are working with the fork of MericoDev.

Version

0.10.0

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Jun 08 '22 09:06 lukasgomez

Hi, Thanks for your input. There is a catch here, gitextractor collect author_id based on git repo itself, so it wouldn't know about this github user id at all. And because git is a distributed system, one can fork repo multiple times to different platforms (like github to gitlab to bitbucket), say if we collect these folk repos from different platforms, it is hard to determine which platform userid should a commit belongs to. We are currently working on a People entity to try to solve this kind of problem.

Jun 09 '22 09:06 klesh

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

Aug 01 '22 00:08 github-actions[bot]

Bump, I don't think this issue should be closed.

Aug 01 '22 13:08 marcemv90

Hi, @marcemv90 , but the behavior you expected is not viable. As I said, in reality, a commit may or may not have its author_id pointing to github user id. Relying on GithubUserID is very limited and won't be supported.

I suggest that you take a look at the Team Feature, it allows you to connect multiple email addresses to a Unified Identity.

Aug 01 '22 14:08 klesh

The team feature @klesh mentioned is supported in v0.12.0, which will be released soon. Stay tuned.

Aug 01 '22 14:08 Startrekzky

@lukasgomez @marcemv90 Thanks for putting together such a detailed bug report, really appreciate it!

Like @klesh mentioned, the commits table is created by the `gitextractor plugin which directly extracts data from a git repository and it's impossible to tell a git author's GitHub id using the git repo alone.

If you would like to filter/group commits by a specific GitHub user, I'd also recommend looking into the team configuration feature that's going to be shipped in v0.12.0. The release candidate images are already available on docker hub, we're just going through Apache's formal voting process for releases. If you'd like to get a taste now, feel free to try the images below :-)

Aug 01 '22 21:08 hezyin

Hi, We will try this new feature to try to filter commits by GitHub user and test the release candidate images. Hope this solution helps us to achieve the desired behavior. Thanks for your response and your time 😄 .

Aug 02 '22 10:08 lukasgomez

@lukasgomez Hi Lukas, we'd love to support you guys in implementing DevLake for your team and get your feedback. Would you be open to a quick Zoom conversation to get properly connected? Connecting with users is incredibly helpful for us to keep improving DevLake.

Aug 04 '22 01:08 hezyin

Sure! We will discuss with our managers which topics we can raise and then we can try to find time to schedule a meeting. We will be in touch 🙂

Aug 04 '22 13:08 lukasgomez

@lukasgomez Hi Lukas, just checking in on this issue and see if you guys need any further assistance. Also, did you get the chance to discuss with your managers? I'm still up for a chat and any feature request or feedback for DevLake is greatly appreciated.

Aug 24 '22 01:08 hezyin

@lukasgomez Hi Lukas, I'm going to close the issue for now, feel free to reach out if you need more assistance.

Aug 29 '22 17:08 hezyin

@hezyin @klesh Sorry to pull up an old issue but can you provide some details on how the Teams functionality can help with this problem. When implementing Teams it seems that having a record in the accounts table is key to the process so that it an then be mapped through the mapping process. gitextractor does not create any accounts in this table so mapping seems problematic.

Sep 28 '23 00:09 richard-fletcher

@richard-fletcher I think you are right, the gitextractor should fill the accounts table. @Startrekzky what do you think?

Oct 09 '23 09:10 klesh

@klesh I think the gitextractor plugin already populated the accounts table with the email value rather than the platform id? Do you mean that we should populate with the platform's account id?

Oct 09 '23 10:10 Startrekzky

@Startrekzky I think you are correct. Let me test this out and confirm back if I'm still having issues.

Oct 09 '23 15:10 richard-fletcher