[Bug] [Database] author_id column in commits table of database contains an email instead of an id
Search before asking
- [X] I had searched in the issues and found no similar issues.
What happened
Exploring the structure of the database and the content of the tables once we scanned few repositories, we realized that the values of the column author_id in the table commits, contains the same value of column author_email. We compared this with another table and in pull_requests table there is a column also called author_id, but its content is the unique GitHub id of the user that created the pull request.
What you expected to happen
We expected that the content of author_id in Commits had the same value as author_id in pull_requests (ex: github:GithubUser:11111111). The design of the table commits contains two columns with the same values which means had duplicated values in the table, and it is not an ideal design. Have this unique identifier in the table commits will help to create dashboards in Grafana to obtain more exactly metrics.
Without an unique author_id on commits table, it's not possible obtain all the commits of an individual as now each combination of email + display name makes a different commiter.
Actual behavior:
| author_name | author_email | author_id |
|---|---|---|
| Jon Doe | [email protected] | [email protected] |
| Jon Doe | [email protected] | [email protected] |
(Both emails belong to the same person)
Expected behavior:
| author_name | author_email | author_id |
|---|---|---|
| Jon Doe | [email protected] | github:GithubUser:11111111 |
| Jon Doe | [email protected] | github:GithubUser:11111111 |
How to reproduce
- Have a GitHub connection configured with a token
- Go to Pipelines > Create Pipeline Run.
- Click on Create Pipeline Run.
- Scroll down to until 'Github' is shown in Data Providers list.
- Toggle on GitHub Data provider
- Enter repository owner and name for a repository that contains few commits and pull requests created by different users.
- Click on 'Run Pipeline'
- Once the register of the repository have finished, go to Pipelines > Create Pipeline Run.
- Click on Create Pipeline Run
- Scroll down to until 'Advanced Mode' option at the bottom appears.
- Click on 'Advanced Mode'.
- Create a task in the task editor to launch a GitHub Extractor Task.
- Use the following JSON:
[ [ { "Plugin": "gitextractor", "Options": { "url": "Url of the repository registered in the previous step ended with .git", "repoId": "Github repository id. It looks like -> github:GithubRepo:384111310", "user": "Name of the user who is the owner of the GitHub Token", "password": "GitHub Token" } } ] - Click on 'Run Pipeline'
- Once the scan have finished, connect to the database
- Use the table
commits - Execute the query:
SELECT author_email, author_id FROM lake.commits;
Anything else
Not sure about the exact version of lake we are using, because we are working with the fork of MericoDev.
Version
0.10.0
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Hi, Thanks for your input.
There is a catch here, gitextractor collect author_id based on git repo itself, so it wouldn't know about this github user id at all.
And because git is a distributed system, one can fork repo multiple times to different platforms (like github to gitlab to bitbucket), say if we collect these folk repos from different platforms, it is hard to determine which platform userid should a commit belongs to.
We are currently working on a People entity to try to solve this kind of problem.
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
Bump, I don't think this issue should be closed.
Hi, @marcemv90 , but the behavior you expected is not viable.
As I said, in reality, a commit may or may not have its author_id pointing to github user id.
Relying on GithubUserID is very limited and won't be supported.
I suggest that you take a look at the Team Feature, it allows you to connect multiple email addresses to a Unified Identity.
The team feature @klesh mentioned is supported in v0.12.0, which will be released soon. Stay tuned.
@lukasgomez @marcemv90 Thanks for putting together such a detailed bug report, really appreciate it!
Like @klesh mentioned, the commits table is created by the `gitextractor plugin which directly extracts data from a git repository and it's impossible to tell a git author's GitHub id using the git repo alone.
If you would like to filter/group commits by a specific GitHub user, I'd also recommend looking into the team configuration feature that's going to be shipped in v0.12.0. The release candidate images are already available on docker hub, we're just going through Apache's formal voting process for releases. If you'd like to get a taste now, feel free to try the images below :-)
Hi, We will try this new feature to try to filter commits by GitHub user and test the release candidate images. Hope this solution helps us to achieve the desired behavior. Thanks for your response and your time 😄 .
@lukasgomez Hi Lukas, we'd love to support you guys in implementing DevLake for your team and get your feedback. Would you be open to a quick Zoom conversation to get properly connected? Connecting with users is incredibly helpful for us to keep improving DevLake.
Sure! We will discuss with our managers which topics we can raise and then we can try to find time to schedule a meeting. We will be in touch 🙂
@lukasgomez Hi Lukas, just checking in on this issue and see if you guys need any further assistance. Also, did you get the chance to discuss with your managers? I'm still up for a chat and any feature request or feedback for DevLake is greatly appreciated.
@lukasgomez Hi Lukas, I'm going to close the issue for now, feel free to reach out if you need more assistance.
@hezyin @klesh Sorry to pull up an old issue but can you provide some details on how the Teams functionality can help with this problem. When implementing Teams it seems that having a record in the accounts table is key to the process so that it an then be mapped through the mapping process. gitextractor does not create any accounts in this table so mapping seems problematic.
@richard-fletcher I think you are right, the gitextractor should fill the accounts table.
@Startrekzky what do you think?
@klesh I think the gitextractor plugin already populated the accounts table with the email value rather than the platform id? Do you mean that we should populate with the platform's account id?
@Startrekzky I think you are correct. Let me test this out and confirm back if I'm still having issues.