incubator-devlake
incubator-devlake copied to clipboard
[Feature][GitExtractor] Support incremental sync when collecting Git data
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Use case
As a user who has large repos with more than 100,000 commits, I'd like to have incremental sync when collecting Git data.
Currently, every pipeline takes more than 5 hours to collect data. That makes it difficult to utilize DevLake in my org.
Description
Support incremental sync in the GitExtractor plugin. Specifically,
| Entity | Sync Mode | Cursor Field |
|---|---|---|
| repos | Full refresh. There's no need to be incremental | N/A |
| refs | Full refresh. There's no create/update date of the ref as far as I know | N/A |
| commits | Incremental | committed_date. It seems to make more sense than commits.authored_date |
| commit_files | Incremental | committed_date. Update the commit_files of the new commits. |
Related issues
https://github.com/apache/incubator-devlake/issues/6138
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
GitExtractor now collects repo's all commits every time when it's executed. I'll update it and make GitExtractor only collect new commits after last run with go-git.
But GitExtractor will ignore project's sync policy such FullSync or TimeAfter. There are some reasons:
- When
componentsconfig is updated, all data incommit_file_componentsshould be recalculated. But there is no entry point updating it in Config UI. So supportingFullSyncis unecessary so far. - In a repository, commit id may change when
rebasehappens. After rebase operation, old commits will be dangling comits, and if these commits still exist in database, they have no side effect. And new commit ids will be collected in the next run.
In a summary, GitExtractor will not support FullSync so far.
I wan to implement this feature in package go-git, after merging #6701. But in #6701, I do some benchmarks with go-git, the result is pessimistic. When collecting commits' detail, go-git is slower than libgit2 about 5-6 times(46min vs 6hour on clickhouse), which is unacceptable. So I have to deply the development until we can get a similar performance with go-git.
cc @Startrekzky @klesh
This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.
Closed by #7319