incubator-devlake icon indicating copy to clipboard operation
incubator-devlake copied to clipboard

[Feature][GitExtractor] Support incremental sync when collecting Git data

Open Startrekzky opened this issue 1 year ago • 2 comments

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

Use case

As a user who has large repos with more than 100,000 commits, I'd like to have incremental sync when collecting Git data.

Currently, every pipeline takes more than 5 hours to collect data. That makes it difficult to utilize DevLake in my org.

Description

Support incremental sync in the GitExtractor plugin. Specifically,

Entity Sync Mode Cursor Field
repos Full refresh. There's no need to be incremental N/A
refs Full refresh. There's no create/update date of the ref as far as I know N/A
commits Incremental committed_date. It seems to make more sense than commits.authored_date
commit_files Incremental committed_date. Update the commit_files of the new commits.

Related issues

https://github.com/apache/incubator-devlake/issues/6138

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

Startrekzky avatar Jan 19 '24 12:01 Startrekzky

GitExtractor now collects repo's all commits every time when it's executed. I'll update it and make GitExtractor only collect new commits after last run with go-git. But GitExtractor will ignore project's sync policy such FullSync or TimeAfter. There are some reasons:

  1. When components config is updated, all data in commit_file_components should be recalculated. But there is no entry point updating it in Config UI. So supporting FullSync is unecessary so far.
  2. In a repository, commit id may change when rebase happens. After rebase operation, old commits will be dangling comits, and if these commits still exist in database, they have no side effect. And new commit ids will be collected in the next run.

In a summary, GitExtractor will not support FullSync so far.

d4x1 avatar Feb 06 '24 13:02 d4x1

I wan to implement this feature in package go-git, after merging #6701. But in #6701, I do some benchmarks with go-git, the result is pessimistic. When collecting commits' detail, go-git is slower than libgit2 about 5-6 times(46min vs 6hour on clickhouse), which is unacceptable. So I have to deply the development until we can get a similar performance with go-git.

cc @Startrekzky @klesh

d4x1 avatar Feb 21 '24 09:02 d4x1

This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Apr 22 '24 00:04 github-actions[bot]

Closed by #7319

klesh avatar Apr 24 '24 06:04 klesh