bitrise-grafana-api icon indicating copy to clipboard operation
bitrise-grafana-api copied to clipboard

Some builds are showing up with large duration and queue times

Open anuerind opened this issue 3 years ago • 4 comments

There are several builds (mostly those aborted in the queue) that are being reported as having a large execution and queue time. These range from 8694 to 40384 minutes which is impossible since the bitrise elite plan has a limit of 90 minutes. Checking the results via https://api-docs.bitrise.io/#/builds/build-show shows something similar below:

{ "data": { "triggered_at": "2021-03-10T09:57:02Z", "started_on_worker_at": null, "environment_prepare_finished_at": null, "finished_at": "2021-03-10T10:02:16Z", "slug": "13141942932b35cb", "status": 3, "status_text": "aborted", "abort_reason": "User PXXXX requested to abort this build.", "is_on_hold": false,

....

}

And the data that arrived to grafana has these

Status Duration Queued
aborted 8694 8694

anuerind avatar Mar 18 '21 02:03 anuerind

Hi Anuerind,

This is the time the build was waiting not running.

To calculate the running duration use started_on_worker_at instead of triggered_at

DamienBitrise avatar Mar 18 '21 03:03 DamienBitrise

@DamienBitrise thanks for confirming. Would that mean that when visualizing the data the aborted ones should not be included? Shouldnt the Duration field for aborted statuses remain at zero since the Queued column already contains the duration that the build is waiting? Is there a way to distinguish the entries that were aborted either due to timeout or manually while running?

anuerind avatar Mar 18 '21 03:03 anuerind

I saw an outlier as well that was successful but the Duration and Queued values are also very high. The API returns the following:

{ "data": { "triggered_at": "2021-02-18T11:58:36Z", "started_on_worker_at": null, "environment_prepare_finished_at": null, "finished_at": "2021-02-18T12:16:39Z", "slug": "b991a9b3fe69a301", "status": 1, "status_text": "success", "abort_reason": null, "is_on_hold": false,

"triggered_by": "webhook",
"machine_type_id": "elite",
"stack_identifier": "osx-xcode-11.7.x",
...

} }

and the data that is shown in grafana is this:

Build # Status Duration Queued
11059 success 37372.4 37372.4

I think what caused this outlier is that while it was successful (run time is around 18 minutes), the started_on_worker _at field remains set to null.

anuerind avatar Mar 18 '21 03:03 anuerind

Caveat that I am not a developer (for more than 15 years).

I was looking at the builds.js logic and it may be that there should be an initial filtering of completed builds (abort, errored, successful) since the logic in the getBuildTableData function is biased in using the started_on_worker_at field to determine if the build is still active or unserviced. The first if-check on an "aborted but never started" build wilkl then use the current timestamp to determine the queue time. This will then result to a very large false-positive value if the retrieval was done several hours after the build request was aborted.

anuerind avatar Mar 18 '21 03:03 anuerind