incubator-devlake icon indicating copy to clipboard operation
incubator-devlake copied to clipboard

[Bug][gitlab] The data in the commits table is different from the data in GitLab

Open Shikanor opened this issue 1 year ago • 4 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

What happened

While attempting to use the commits table for a code volume analysis, I encountered an unexpected scenario: a portion of the additions in the commits table did not match the additions displayed on the GitLab page, especially following merge operations. To confirm whether this was an issue with the token, I used the same token to write a Python script for verification. The results showed that indeed, there are occasional discrepancies between the data in the commits table and the data on GitLab.

image image image

What do you expect to happen

I hope to minimize these discrepancies as much as possible, as I'm not sure if one day they might affect the code volume statistics for an entire individual.

How to reproduce

First, create a feature branch from the main branch, then commit some code on the feature branch. After that, switch back to the main branch and commit changes to a different file (ensuring there are no merge conflicts). Finally, merge the feature branch back into the main branch. After the merge, click the "Collect Data" button to retrieve the commits data from the commits table and compare it with the data on GitLab.

Anything else

No response

Version

v1.0.0-beta7@2cc08c7

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

Shikanor avatar May 17 '24 04:05 Shikanor

To address the discrepancies between the commits table and GitLab data, especially after merge operations, it's crucial to dive deeper into the specifics of your situation. Here's how we can proceed:

  1. Share the Python Script: The script you used for verification could provide insights into how data is being fetched and compared. This will help in understanding the methodology behind your findings.

  2. Identify Patterns: Knowing if these discrepancies follow any specific patterns, such as occurring more with larger commits or certain file types, can help in pinpointing the cause.

  3. Provide Examples: Examples of commit hashes with their discrepancies in additions count between the commits table and GitLab would be invaluable for a detailed investigation.

  4. Scope of the Issue: Clarifying whether this issue is isolated to specific repositories or is more widespread will aid in diagnosing the problem more effectively.

This information will be instrumental in uncovering the root cause of the discrepancies you're observing.

Details

About Dosu This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

dosubot[bot] avatar May 17 '24 04:05 dosubot[bot]

Thanks for reporting.

Startrekzky avatar May 20 '24 12:05 Startrekzky

It appears the libgit2 library we're using has a different diff algorithm compared to GitLab. Unfortunately, this is likely an expected behavior rather than a bug and wouldn't be easily configurable.

klesh avatar May 20 '24 14:05 klesh

Hi, we checked the logic. The difference between additions from GitLab APIs and DevLake is because:

  • DevLake doesn't use GitLab APIs, but the gitextractor plugin to collect commits
  • The gitextractor plugin uses the libgit2 library to calculate the commits additions and deletions.
  • The logic of calculating the additions in gogit and GitLab APIs are slightly different.

Thus, it's a problem that can not be addressed in DevLake for now. We can't switch gitextractor to GitLab APIs or GitHub APIs or Bitbucket APIs or Azure DevOps APIs to collect commits (Git) data, as it will increase the collection time by 10x times.

Startrekzky avatar May 21 '24 09:05 Startrekzky

This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Jul 21 '24 00:07 github-actions[bot]

This issue has been closed because it has been inactive for a long time. You can reopen it if you encounter the similar problem in the future.

github-actions[bot] avatar Jul 28 '24 00:07 github-actions[bot]