[Question][Module Name] Question title
I'm currently using DevLake to collect data from GitLab, and there are a lot of projects to gather. It usually takes five or six days to complete a full collection cycle. Are there any distributed solutions that could help reduce the collection time?
The bottleneck is on GitLab’s side, not with Apache DevLake. A distributed solution wouldn’t address this issue.
Also, why is full data collection necessary? Is there a specific issue with incremental collection?
I would correct bottleneck statement. We have large scale enterprise DevLake installation with GitLab/Jira/Sonar connections, 80+ projects with 10-100 repositories each. Full collection cycle is approx 2 weeks, where "full collection" != "full data gathering", it's just sequental run of all blueprints - with incremental data collection where applicable. It still may take 2-12 hours for large projects. RPS tweaks, database resource increase and other optimizations helped a little bit, but still. To my point of view, main bottleneck is single runner, which just walks all projects one-by-one. Parallel runners may be more effective
I see. Typically, how many jira boards, gitlab projects and sonar projects does a DevLake project have in your setup?
We have 80+ DevLake project, each projects have one board with up to 30.000 issues, and approx 10-100 repos. Number of Sonar packages is up to 30.
Are those repos belonging to the same DevLake Project actually related? DevLake Project is primarily designed for calculating DORA metrics. Typically, a DevLake project includes a few repositories: some for the main codebases, like backend/frontend, and others for DevOps manifests or deployment configurations.
yes, they are related. Some systems are very large scale with long history and big distributed teams.
How about deploying multiple DevLake instances and distributing the projects across them?
not an option, because developers are shared across the teams, and DevLake is used to evaluate their performance/technology KPIs for all projects they're involved in. I.e. we aim to use DevLake as whole company source of metrics. Also multi-project dashboards are heavily used. In other words we have the situation where DevLake in its current state of perfectly polished technology concept meet enterprise volume requirements (and also admin tooling complexity requirements too, but this is out of scope of this particular ticket)
This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.
This issue has been closed because it has been inactive for a long time. You can reopen it if you encounter the similar problem in the future.