incubator-devlake icon indicating copy to clipboard operation
incubator-devlake copied to clipboard

[Refactor]Grafana - Engineering Throughput and Cycle Time] PR review depth and PR size take forever/don't work

Open robgutsopedra opened this issue 2 years ago • 11 comments

What and why to refactor

PR review depth and PR size take forever/don't work at all.

Describe the solution you'd like

I assume they should work as any other panel within Grafana

Related issues

None that I have found

Additional context

I have been focusing my efforst on PR review depth

I was using v0.18.0 query, but also tried query at main branch (which I believe are the same). For context, this:

SELECT
  DATE_ADD(date(pr.created_date), INTERVAL -$interval(date(pr.created_date))+1 DAY) as time,
  count(distinct prc.id)/count(distinct pr.id) as "PR Review Depth"
FROM 
  pull_requests pr
  left join pull_request_comments prc on pr.id = prc.pull_request_id
  join project_mapping pm on pr.base_repo_id = pm.row_id and pm.table = 'repos' 
WHERE
  $__timeFilter(pr.created_date)
  and pm.project_name in ($project)
  and pr.merged_date is not null
group by 1

I also tried a solution proposed by @Startrekzky --

with t1 as(
   select 
       pull_request_id, count(*) as pr_comment_number
   from
       pull_request_comments
   group by 1
)


SELECT
  DATE_ADD(date(pr.created_date), INTERVAL -$interval(date(pr.created_date))+1 DAY) as time,
  sum(t1.pr_comment_number)/count(distinct pr.id) as "PR Review Depth"
FROM 
  pull_requests pr
  left join t1 on t1.pull_request_id = pr.id
  join project_mapping pm on pr.base_repo_id = pm.row_id and pm.table = 'repos' 
WHERE
  $__timeFilter(pr.created_date)
  and pm.project_name in ($project)
  and pr.merged_date is not null
group by 1

Which I think it's working better, but still far from perfect.

In my case, I have around 3600 PRs, 23254 PR comments - 8 * 3600 * 23254 = 669M. I understand it might be big, but I think devLake handles heavier situations than this. Am I the only one getting degraded performance on these queries?

Anyway, I'm far from an SQL expert but I'm more than happy to test/try anything you throw my way!

Thanks a lot!

robgutsopedra avatar Nov 14 '23 11:11 robgutsopedra

Thanks for reporting. I'll look at it soon.

Startrekzky avatar Nov 15 '23 14:11 Startrekzky

If you need anything from me, @Startrekzky , let me know!!

robgutsopedra avatar Nov 15 '23 14:11 robgutsopedra

@robgutsopedra Thanks, I've made further modifications based on the original SQL. Please try it out:

with t1 as(
   select 
       pull_request_id as pr_id, 
       count(*) as pr_comment_number
   from
       pull_request_comments
   group by 1
),

t2 as(
  SELECT
    DATE_ADD(date(pr.created_date), INTERVAL -$interval(date(pr.created_date))+1 DAY) as time,
    pr.id as pr_id
  FROM 
    pull_requests pr
    join project_mapping pm on pr.base_repo_id = pm.row_id and pm.table = 'repos' 
  WHERE
    $__timeFilter(pr.created_date)
    and pm.project_name in ($project)
    and pr.merged_date is not null
)

-- select t2.*, t1.pr_comment_number FROM  t2 left join t1 on t2.pr_id = t1.pr_id
select 
  t2.time,
  t1.pr_comment_number,
  t2.pr_id,
  sum(t1.pr_comment_number)/count(distinct t2.pr_id) as "PR Review Depth"
from 
  t2
  left join t1 on t2.pr_id = t1.pr_id
group by 1

Startrekzky avatar Nov 16 '23 16:11 Startrekzky

Hello @Startrekzky ! Thanks a lot for your help! I have tried your suggestion, but I'm afraid it doesn't seem to work:

image

I tried to do some modifications and test it directly on the DB, and the last "GROUP_BY" failed because:

SQL Error [1055] [42000]: Expression #2 of SELECT list is not in GROUP BY clause and contains nonaggregated column 't1.pr_comment_number' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by

So, trying to be sueful, I finally went with:

with t1 as(
   select 
       pull_request_id as pr_id, 
       count(*) as pr_comment_number
   from
       pull_request_comments
   group by 1
),

t2 as(
  SELECT
    DATE_ADD(date(pr.created_date), INTERVAL -$interval(date(pr.created_date))+1 DAY) as times,
    pr.id as pr_id
  FROM 
    pull_requests pr
    join project_mapping pm on pr.base_repo_id = pm.row_id and pm.table = 'repos' 
  WHERE
    $__timeFilter(pr.created_date)
    and pm.project_name in ($project)
    and pr.merged_date is not null
)

-- select t2.*, t1.pr_comment_number FROM  t2 left join t1 on t2.pr_id = t1.pr_id
select 
  t2.times,
  t1.pr_comment_number,
  t2.pr_id,
  sum(t1.pr_comment_number)/count(distinct t2.pr_id) as "PR Review Depth"
FROM 
  t2
  LEFT JOIN t1 ON t2.pr_id = t1.pr_id
GROUP BY t2.times, t2.pr_id, t1.pr_comment_number;

But I'm afraid is not working either: image

Any suggestion is super welcomed!

robgutsopedra avatar Nov 27 '23 08:11 robgutsopedra

@robgutsopedra Hmm. It worked when I tested it on my local machine. I'll get back to you soon.

Startrekzky avatar Nov 27 '23 11:11 Startrekzky

Hi @robgutsopedra , can you try out this SQL?

with _pr_commits_data as(
  SELECT
    DATE_ADD(date(pr.created_date), INTERVAL -MONTH(date(pr.created_date))+1 DAY) as time,
    pr.id as pr_id,
    pr.merge_commit_sha,
    sum(c.additions)+sum(c.deletions) as loc
  FROM 
    pull_requests pr
    left join commits c on pr.merge_commit_sha = c.sha
    join project_mapping pm on pr.base_repo_id = pm.row_id and pm.table = 'repos' 
  WHERE
    $__timeFilter(pr.created_date)
    and pm.project_name in ($project)
  group by 1,2,3
)

SELECT 
  time,
  sum(loc)/count(distinct pr_id) as 'PR Size'
FROM _pr_commits_data
GROUP BY 1

Startrekzky avatar Dec 01 '23 13:12 Startrekzky

Just need and "AND" in the where clause, and it works perfectly!!

with _pr_commits_data as(
  SELECT
    DATE_ADD(date(pr.created_date), INTERVAL -MONTH(date(pr.created_date))+1 DAY) as time,
    [pr.id](http://pr.id/) as pr_id,
    pr.merge_commit_sha,
    sum(c.additions)+sum(c.deletions) as loc
  FROM 
    pull_requests pr
    left join commits c on pr.merge_commit_sha = c.sha
    join project_mapping pm on pr.base_repo_id = pm.row_id and pm.table = 'repos' 
  WHERE
    $__timeFilter(pr.created_date) AND
    pm.project_name in ($project)
  group by 1,2,3
)

SELECT 
  time,
  sum(loc)/count(distinct pr_id) as 'PR Size'
FROM _pr_commits_data
GROUP BY 1

It worked almost immediately!!

The only query that's still not working fine is PR size. Could I be over-ambitious and ask for some help on that one?

A million thanks for your help!

robgutsopedra avatar Dec 01 '23 13:12 robgutsopedra

@robgutsopedra You're welcome.

The only query that's still not working fine is PR size. Could I be over-ambitious and ask for some help on that one?

I'm a bit confused. Isn't the above SQL for PR size?

Startrekzky avatar Dec 04 '23 02:12 Startrekzky

Agh, sorry @Startrekzky , I was referring to PR Review Depth, that's the only one not working yet!

Sorry for the confusion!

robgutsopedra avatar Dec 04 '23 08:12 robgutsopedra

Agh, sorry @Startrekzky , I was referring to PR Review Depth, that's the only one not working yet!

Sorry for the confusion!

I see. I'll check it out.

Startrekzky avatar Dec 04 '23 08:12 Startrekzky

This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Feb 17 '24 00:02 github-actions[bot]

This issue has been closed because it has been inactive for a long time. You can reopen it if you encounter the similar problem in the future.

github-actions[bot] avatar Feb 25 '24 00:02 github-actions[bot]