[Bug][gitlab] The data in the commits table is different from the data in GitLab
Search before asking
- [X] I had searched in the issues and found no similar issues.
What happened
While attempting to use the commits table for a code volume analysis, I encountered an unexpected scenario: a portion of the additions in the commits table did not match the additions displayed on the GitLab page, especially following merge operations. To confirm whether this was an issue with the token, I used the same token to write a Python script for verification. The results showed that indeed, there are occasional discrepancies between the data in the commits table and the data on GitLab.
What do you expect to happen
I hope to minimize these discrepancies as much as possible, as I'm not sure if one day they might affect the code volume statistics for an entire individual.
How to reproduce
First, create a feature branch from the main branch, then commit some code on the feature branch. After that, switch back to the main branch and commit changes to a different file (ensuring there are no merge conflicts). Finally, merge the feature branch back into the main branch. After the merge, click the "Collect Data" button to retrieve the commits data from the commits table and compare it with the data on GitLab.
Anything else
No response
Version
v1.0.0-beta7@2cc08c7
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
To address the discrepancies between the commits table and GitLab data, especially after merge operations, it's crucial to dive deeper into the specifics of your situation. Here's how we can proceed:
-
Share the Python Script: The script you used for verification could provide insights into how data is being fetched and compared. This will help in understanding the methodology behind your findings.
-
Identify Patterns: Knowing if these discrepancies follow any specific patterns, such as occurring more with larger commits or certain file types, can help in pinpointing the cause.
-
Provide Examples: Examples of commit hashes with their discrepancies in additions count between the commits table and GitLab would be invaluable for a detailed investigation.
-
Scope of the Issue: Clarifying whether this issue is isolated to specific repositories or is more widespread will aid in diagnosing the problem more effectively.
This information will be instrumental in uncovering the root cause of the discrepancies you're observing.
Details
About Dosu This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
To continue the conversation, mention @dosu.
Thanks for reporting.
It appears the libgit2 library we're using has a different diff algorithm compared to GitLab. Unfortunately, this is likely an expected behavior rather than a bug and wouldn't be easily configurable.
Hi, we checked the logic. The difference between additions from GitLab APIs and DevLake is because:
- DevLake doesn't use GitLab APIs, but the
gitextractorplugin to collectcommits - The
gitextractorplugin uses thelibgit2library to calculate the commits additions and deletions. - The logic of calculating the additions in
gogitandGitLab APIsare slightly different.
Thus, it's a problem that can not be addressed in DevLake for now. We can't switch gitextractor to GitLab APIs or GitHub APIs or Bitbucket APIs or Azure DevOps APIs to collect commits (Git) data, as it will increase the collection time by 10x times.
This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.
This issue has been closed because it has been inactive for a long time. You can reopen it if you encounter the similar problem in the future.