incubator-devlake icon indicating copy to clipboard operation
incubator-devlake copied to clipboard

[Question][Module Name] Question title

Open FrankBai811 opened this issue 10 months ago • 8 comments

I'm currently using DevLake to collect data from GitLab, and there are a lot of projects to gather. It usually takes five or six days to complete a full collection cycle. Are there any distributed solutions that could help reduce the collection time?

FrankBai811 avatar May 22 '25 06:05 FrankBai811

The bottleneck is on GitLab’s side, not with Apache DevLake. A distributed solution wouldn’t address this issue.

Also, why is full data collection necessary? Is there a specific issue with incremental collection?

klesh avatar May 26 '25 07:05 klesh

I would correct bottleneck statement. We have large scale enterprise DevLake installation with GitLab/Jira/Sonar connections, 80+ projects with 10-100 repositories each. Full collection cycle is approx 2 weeks, where "full collection" != "full data gathering", it's just sequental run of all blueprints - with incremental data collection where applicable. It still may take 2-12 hours for large projects. RPS tweaks, database resource increase and other optimizations helped a little bit, but still. To my point of view, main bottleneck is single runner, which just walks all projects one-by-one. Parallel runners may be more effective

p1ne avatar May 26 '25 07:05 p1ne

I see. Typically, how many jira boards, gitlab projects and sonar projects does a DevLake project have in your setup?

klesh avatar May 26 '25 07:05 klesh

We have 80+ DevLake project, each projects have one board with up to 30.000 issues, and approx 10-100 repos. Number of Sonar packages is up to 30.

p1ne avatar May 26 '25 07:05 p1ne

Are those repos belonging to the same DevLake Project actually related? DevLake Project is primarily designed for calculating DORA metrics. Typically, a DevLake project includes a few repositories: some for the main codebases, like backend/frontend, and others for DevOps manifests or deployment configurations.

klesh avatar May 26 '25 08:05 klesh

yes, they are related. Some systems are very large scale with long history and big distributed teams.

p1ne avatar May 26 '25 08:05 p1ne

How about deploying multiple DevLake instances and distributing the projects across them?

klesh avatar May 26 '25 10:05 klesh

not an option, because developers are shared across the teams, and DevLake is used to evaluate their performance/technology KPIs for all projects they're involved in. I.e. we aim to use DevLake as whole company source of metrics. Also multi-project dashboards are heavily used. In other words we have the situation where DevLake in its current state of perfectly polished technology concept meet enterprise volume requirements (and also admin tooling complexity requirements too, but this is out of scope of this particular ticket)

p1ne avatar May 26 '25 10:05 p1ne

This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Jul 26 '25 00:07 github-actions[bot]

This issue has been closed because it has been inactive for a long time. You can reopen it if you encounter the similar problem in the future.

github-actions[bot] avatar Aug 02 '25 00:08 github-actions[bot]