incubator-devlake icon indicating copy to clipboard operation
incubator-devlake copied to clipboard

[Bug][DORA-change_lead_time_calculator] Wrong deployment reference

Open kostas-petrakis opened this issue 1 year ago • 18 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

What happened

While analyzing the data gathered from our GitHub integration, I stumbled upon an anomaly in a minor subset of deployments. Oddly enough, some deployments were reporting a deployment of incosistent time (in my example 30 weeks). Given my familiarity with our GitHub operations, I was certain this wasn't accurate, prompting me to dig deeper. My investigation points towards a potential issue with the getDeploymentCommit function, particularly in relation to EnrichPrevSuccessDeploymentCommit.

To illustrate, I've attached an example PR

{
"pull_requests": [
	{
		"id" : "github:GithubPullRequest:4:1806504909",
		"created_at" : "2024-11-12 08:34:13.134",
		"updated_at" : "2024-11-12 08:34:13.134",
		"_raw_data_params" : "{\"ConnectionId\":4,\"Name\":\"********\"}",
		"_raw_data_table" : "_raw_github_api_pull_requests",
		"_raw_data_id" : 4499,
		"_raw_data_remark" : "",
		"base_repo_id" : "github:GithubRepo:4:***********",
		"base_ref" : "main",
		"base_commit_sha" : "a51af0dabd2d9e9bd7b23e432688842cf086c248",
		"head_repo_id" : "github:GithubRepo:4:*******",
		"head_ref" : "feature\***************",
		"head_commit_sha" : "e6a73f3095d1c97d4d4576e3c06719fb07938bd1",
		"merge_commit_sha" : "b297c73092685c4b1a0ee62c2326d77115a86e94",
		"status" : "MERGED",
		"original_status" : "closed",
		"type" : "",
		"component" : "",
		"title" : "****************************",
		"description" : "null",
		"url" : "https:\/\/github.com\/******\/*******\/pull\/1025",
		"author_name" : "***************",
		"author_id" : "github:GithubAccount:4:***********",
		"parent_pr_id" : "",
		"pull_request_key" : 1025,
		"created_date" : "2024-04-04 09:08:03",
		"merged_date" : "2024-04-05 09:00:16",
		"closed_date" : "2024-04-05 09:00:16",
		"additions" : 0,
		"deletions" : 0,
		"merged_by_name" : "",
		"merged_by_id" : "github:GithubAccount:4:0",
		"is_draft" : 0
	}
]}

As you can see this PR was merged 2024-04-05 09:00:16 yet the deployment linked is incorrect, as it belongs to a different deployment.

{
"project_pr_metrics": [
	{
		"id" : "github:GithubPullRequest:4:1806504909",
		"created_at" : "2024-11-12 08:38:29.797",
		"updated_at" : "2024-11-12 08:38:29.797",
		"_raw_data_params" : "",
		"_raw_data_table" : "",
		"_raw_data_id" : 0,
		"_raw_data_remark" : "",
		"project_name" : "*************",
		"first_commit_sha" : "3726ef9fce54e1f95717ab0fbe695403cbf850ff",
		"pr_coding_time" : 1,
		"first_review_id" : "github:GithubPrReview:4:******",
		"pr_pickup_time" : 4,
		"pr_review_time" : 1429,
		"deployment_commit_id" : "github:GithubRun:4:435823930:11612162783:https:\/\/github.com\/*****\/****",
		"pr_deploy_time" : 301201,
		"pr_cycle_time" : 302635,
		"first_commit_authored_date" : "2024-04-04 09:07:51",
		"pr_created_date" : "2024-04-04 09:08:03",
		"first_comment_date" : "2024-04-04 09:11:44",
		"pr_merged_date" : "2024-04-05 09:00:16",
		"pr_deployed_date" : "2024-10-31 13:00:17"
	}
]}

Manually checking in the cicd_deployment_commits for the merge_commit_sha (b297c73092685c4b1a0ee62c2326d77115a86e94) I can see that the deployment is there:

{
"cicd_deployment_commits": [
	{
		"id" : "github:GithubRun:4:435823930:8567339272:https:\/\/github.com\/l*****\/****",
		"created_at" : "2024-11-12 11:17:32.066",
		"updated_at" : "2024-11-12 11:17:33.602",
		"_raw_data_params" : "{\"ConnectionId\":4,\"Name\":\"*****\/****\"}",
		"_raw_data_table" : "_raw_github_api_runs",
		"_raw_data_id" : 2338,
		"_raw_data_remark" : "prev_deployment_commit_id_enricher,",
		"cicd_scope_id" : "github:GithubRepo:4:435823930",
		"cicd_deployment_id" : "github:GithubRun:4:435823930:8567339272",
		"name" : "CI\/CD",
		"result" : "SUCCESS",
		"original_result" : "success",
		"status" : "DONE",
		"original_status" : "completed",
		"environment" : "PRODUCTION",
		"original_environment" : "",
		"created_date" : "2024-04-05 09:00:19",
		"queued_date" : null,
		"queued_duration_sec" : null,
		"started_date" : "2024-04-05 09:00:19",
		"finished_date" : "2024-04-05 09:14:22",
		"duration_sec" : 843.0,
		"commit_sha" : "b297c73092685c4b1a0ee62c2326d77115a86e94",
		"ref_name" : "main",
		"repo_id" : "github:GithubRepo:4:435823930",
		"repo_url" : "https:\/\/github.com\/*****\/****",
		"prev_success_deployment_commit_id" : "",
		"commit_msg" : "",
		"subtask_name" : "dora.generateDeploymentCommits",
		"display_title" : "Merge pull request #1025 from *****\/****",
		"url" : "https:\/\/github.com\/*****\/****/actions\/runs\/8567339272"
	}
]}

Upon inspection, I've noticed that the prev_success_deployment_commit_id field is empty in our data. I suspect this might be influencing the results of the query in the change_lead_time_calculator.go module.

I'm uncertain if the query should return null when faced with this condition, given that prev_success_deployment_commit_id is consistently empty for the first deployment recorded for the cicd_scope_id/repo_url/env grouping, as per code.

Interestingly, the linked "deployment_commit_id" : "github:GithubRun:4:435823930:11612162783 is the final entry in the cicd_deployment_commits table for the cicd_scope_id/repo_url/env grouping.

What do you expect to happen

PR deployment should be linked to the proper run for this example (github:GithubRun:4:435823930:8567339272:https:\/\/github.com\/l*****\/****)

How to reproduce

I will just describe what the status of that repo is, GitHub, deployment through CI/CD workflow, explore for the first PR deployment recorded by Devlake and evaluate the linked deployment. This issue is only related to the first deployment for each cicd_scope_id/repo_url/env grouping.

Anything else

No response

Version

v1.0.1@e061ef2

Are you willing to submit PR?

  • [x] Yes I am willing to submit a PR!

Code of Conduct

kostas-petrakis avatar Nov 12 '24 13:11 kostas-petrakis

Hello @kostas-petrakis! I'm here to help you with any bugs, questions, or contributions you have. Let me know how I can assist you while we wait for a human maintainer.

The issue you're encountering seems to be related to how the getDeploymentCommit function handles cases where the prev_success_deployment_commit_id is empty. This situation typically occurs for the first deployment in a cicd_scope_id/repo_url/env grouping, as the EnrichPrevSuccessDeploymentCommit function resets this field to an empty string for the first deployment in a new set of consecutive deployments [1].

In the change_lead_time_calculator.go module, the logic for selecting the deployment commit might not be correctly handling cases where prev_success_deployment_commit_id is empty, leading to incorrect deployment references being linked. This could explain why the deployment time is inaccurately reported as 30 weeks.

To address this, you might need to adjust the logic in the getDeploymentCommit function to ensure it correctly identifies and links the first deployment in such groupings. This could involve modifying the query to account for cases where prev_success_deployment_commit_id is empty and ensuring it selects the appropriate deployment commit for the first deployment scenario.

Since this issue is specific to the first deployment in each grouping, reviewing and potentially updating the logic in both the getDeploymentCommit and EnrichPrevSuccessDeploymentCommit functions to handle these edge cases more effectively would be a good starting point.

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Nov 12 '24 13:11 dosubot[bot]

Hi @kostas-petrakis , I saw your PR https://github.com/apache/incubator-devlake/pull/8206

One of DevLake's strategies of PR-deployment_commit mapping is to skip the mapping of the very first deployment_commit in DevLake. Why?

Because in many cases,

  1. Users will collect both PRs and deployment_commits within a selected timeframe.
  2. When finding a PR's associated deployment_commit, we need to diff the commits between every consecutive deployment_commits, which means we don't know the precise commits deployed by the very first deployment_commit collected by DevLake.
  3. Hence, we don't know the PRs deployed by the very first deployment_commit.

In your case, the PR github:GithubPullRequest:4:1806504909 might be deployed by the first deployement_commit github:GithubRun:4:435823930:8567339272:https:\/\/github.com\/l*****\/**** for sure.

However, if we allow the first deployment_commit mapped to PRs. If the deploy time between the prior deployment to the first deployment collected by DevLake, many irrelevant PRs merged during this period will be mis-mapped to the deployment commit (some other users had encountered the problem).

Finally, we adopted the above strategy of excluding the very first deployment when mapping to PR.

Startrekzky avatar Nov 27 '24 11:11 Startrekzky

@Startrekzky thanks for the explanation. However, in my specific use case (and I assume to others looking at the upvotes), I've observed that PRs associated with the initial recorded commit are being mapped to an incorrect deployment. This seems to introduce a different kind of mis-mapping issue, which affects the accuracy of our metrics (as explained with the data above). I'm wondering if there might be a way to refactor the PR-deployment mapping process to handle this situation more accurately.

kostas-petrakis avatar Dec 04 '24 14:12 kostas-petrakis

This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Feb 03 '25 00:02 github-actions[bot]

We've been tracking this and it's still an issue for us! Although we're transitioning over to Github Actions we still have a lot of teams using Github for commits and an old Jenkins set up for deploy. Where devlake is finding PRs within Github but can't find a corresponding deploy it's choosing (we think) the first and so earliest deploy available to it, creating massive data skews of 10s of weeks.

If there is a PR that says it's been deployed but devlake can't find a deploy job we'd expect it to just not include that PR at all

marktuckcp avatar Feb 07 '25 08:02 marktuckcp

We are in the process of analyzing some of the PRs that have been wrongly mapped to non relevant deployments. This is likely due to the previously mentioned query, which appears to be producing inaccurate results during transformation. We are currently investigating affected PRs.

kostas-petrakis avatar Feb 26 '25 08:02 kostas-petrakis

Appreciate it @kostas-petrakis ! We'll keep an eye on this thread, thanks

marktuckcp avatar Mar 25 '25 14:03 marktuckcp

Hello @kostas-petrakis , we are also keeping track of this ticket since we are affected. Have you found out anything more about your affected PRs? Thanks.

nicolavolpini avatar May 08 '25 07:05 nicolavolpini

Hi @nicolavolpini I have to revisit this issue, as my original tests were performed over older versions of Apache Devlake locally. Since then we deployed v1.0.2-beta4 for "production" use. I will try to run a local experiment at some point and see if this case is still applicable. Could you share which version of Apache Devlake you have deployed?

kostas-petrakis avatar May 08 '25 17:05 kostas-petrakis

Hello Kostas, thanks for the update. We are on v1.0.2-beta7.

nicolavolpini avatar May 09 '25 07:05 nicolavolpini

@Startrekzky I am seeing similar behaviour although a different use case.

We have historic PRs that do not have any deployment commits associated.

During testing I noticed that all historic PRs without an associated deployment can be simultaneously populated by a deployment commit ID that has no link to it.

This doesnt happen all the time so I am investigating what might trigger it.

I thought looking at the GO code that there has to be a merge commit hash match for a deployment commit ID to associate with a pull request in project PR metrics table.

This is hugely annoying for our usecase as it completely skews the median lead time for change.

v1.0.2-beta6

hera-a-glitch avatar May 23 '25 12:05 hera-a-glitch

Following up on this issue now that we've resolved the more urgent missing DORA calculations. I can confirm that we're still seeing the same behavior on some of the newly onboarded services. As @hera-a-glitch pointed out, the behavior isn't entirely consistent—likely due to recalculation of DORA? (need to check) From what I’ve observed, this issue only affects the first batch of collected commits. The current logic attempts to find a previous deployment; if none is found, it links the PR deployment to the latest available one (if I’m recalling the behavior correctly). I'll dig deeper into this and explore potential solutions. Thanks for your patience!

kostas-petrakis avatar May 23 '25 14:05 kostas-petrakis

This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Jul 23 '25 00:07 github-actions[bot]

This issue has been closed because it has been inactive for a long time. You can reopen it if you encounter the similar problem in the future.

github-actions[bot] avatar Jul 30 '25 00:07 github-actions[bot]

I am re-opening this as it is still affecting some of our repositories and raising questions from engineering teams. I took some time to think about a possible approach which would possibly reduce the effect. @Startrekzky I am thinking of the following approach:

Adjust getDeploymentCommit to use a two-phase strategy:

  • Direct match: if there exists a successful PRODUCTION deployment in the project whose commit_sha equals the PR’s merge commit, return it. This is precise and does not risk the first-deployment over-mapping problem.
  • Fallback to diff-based mapping: retain the current strategy (including the filter dc.prev_success_deployment_commit_id <> '') to avoid mapping to the first deployment via diffs.

The following is how this translates to the implementation (adding here for discussion before opening a PR):

// getDeploymentCommit takes a merge commit SHA, a repository ID, a list of deployment pairs, and a database connection as input.
// It returns the deployment pair related to the merge commit, or nil if not found.
func getDeploymentCommit(mergeSha string, projectName string, db dal.Dal) (*devops.CicdDeploymentCommit, errors.Error) {
	// Phase 1: direct mapping when the deployment run's commit_sha equals the PR's merge commit.
	// This is precise and safe, and should be allowed even if it's the first deployment in the collected window.
	direct := make([]*devops.CicdDeploymentCommit, 0, 1)
	err := db.All(
		&direct,
		dal.Select("dc.*"),
		dal.From("cicd_deployment_commits dc"),
		dal.Join("LEFT JOIN project_mapping pm ON (pm.table = 'cicd_scopes' AND pm.row_id = dc.cicd_scope_id)"),
		dal.Where("pm.project_name = ? AND dc.commit_sha = ? AND dc.result = ? AND dc.environment = 'PRODUCTION'", projectName, mergeSha, devops.RESULT_SUCCESS),
		dal.Orderby("dc.started_date, dc.id ASC"),
		dal.Limit(1),
	)
	if err != nil {
		return nil, err
	}
	if len(direct) > 0 {
		return direct[0], nil
	}

	// Phase 2: fall back to diff-based mapping (skipping the first deployment to avoid over-mapping).
	deploymentCommits := make([]*devops.CicdDeploymentCommit, 0, 1)
	// do not use `.First` method since gorm would append ORDER BY ID to the query which leads to an error
	err = db.All(
		&deploymentCommits,
		dal.Select("dc.*"),
		dal.From("cicd_deployment_commits dc"),
		dal.Join("LEFT JOIN cicd_deployment_commits p ON (dc.prev_success_deployment_commit_id = p.id)"),
		dal.Join("LEFT JOIN project_mapping pm ON (pm.table = 'cicd_scopes' AND pm.row_id = dc.cicd_scope_id)"),
		dal.Join("INNER JOIN commits_diffs cd ON (cd.new_commit_sha = dc.commit_sha AND cd.old_commit_sha = COALESCE (p.commit_sha, ''))"),
		dal.Where("dc.prev_success_deployment_commit_id <> ''"),
		dal.Where("dc.environment = 'PRODUCTION'"), // TODO: remove this when multi-environment is supported
		dal.Where("pm.project_name = ? AND cd.commit_sha = ? AND dc.RESULT = ?", projectName, mergeSha, devops.RESULT_SUCCESS),
		dal.Orderby("dc.started_date, dc.id ASC"),
		dal.Limit(1),
	)
	if err != nil {
		return nil, err
	}
	if len(deploymentCommits) == 0 {
		return nil, nil
	}
	return deploymentCommits[0], nil
}

petkostas avatar Aug 23 '25 14:08 petkostas

Hi @petkostas , thanks for the proposal. I'll discuss with @klesh to figure out how to optimize this part.

Startrekzky avatar Aug 25 '25 04:08 Startrekzky

This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Oct 25 '25 00:10 github-actions[bot]

Will test next week

petkostas avatar Oct 25 '25 08:10 petkostas