bitrise-grafana-api
bitrise-grafana-api copied to clipboard
Some builds are showing up with large duration and queue times
There are several builds (mostly those aborted in the queue) that are being reported as having a large execution and queue time. These range from 8694 to 40384 minutes which is impossible since the bitrise elite plan has a limit of 90 minutes. Checking the results via https://api-docs.bitrise.io/#/builds/build-show shows something similar below:
{ "data": { "triggered_at": "2021-03-10T09:57:02Z", "started_on_worker_at": null, "environment_prepare_finished_at": null, "finished_at": "2021-03-10T10:02:16Z", "slug": "13141942932b35cb", "status": 3, "status_text": "aborted", "abort_reason": "User PXXXX requested to abort this build.", "is_on_hold": false,
....
}
And the data that arrived to grafana has these
Status | Duration | Queued |
---|---|---|
aborted | 8694 | 8694 |
Hi Anuerind,
This is the time the build was waiting not running.
To calculate the running duration use started_on_worker_at
instead of triggered_at
@DamienBitrise thanks for confirming. Would that mean that when visualizing the data the aborted ones should not be included? Shouldnt the Duration field for aborted statuses remain at zero since the Queued column already contains the duration that the build is waiting? Is there a way to distinguish the entries that were aborted either due to timeout or manually while running?
I saw an outlier as well that was successful but the Duration and Queued values are also very high. The API returns the following:
{ "data": { "triggered_at": "2021-02-18T11:58:36Z", "started_on_worker_at": null, "environment_prepare_finished_at": null, "finished_at": "2021-02-18T12:16:39Z", "slug": "b991a9b3fe69a301", "status": 1, "status_text": "success", "abort_reason": null, "is_on_hold": false,
"triggered_by": "webhook", "machine_type_id": "elite", "stack_identifier": "osx-xcode-11.7.x", ...
} }
and the data that is shown in grafana is this:
Build # | Status | Duration | Queued |
---|---|---|---|
11059 | success | 37372.4 | 37372.4 |
I think what caused this outlier is that while it was successful (run time is around 18 minutes), the started_on_worker _at field remains set to null.
Caveat that I am not a developer (for more than 15 years).
I was looking at the builds.js logic and it may be that there should be an initial filtering of completed builds (abort, errored, successful) since the logic in the getBuildTableData function is biased in using the started_on_worker_at field to determine if the build is still active or unserviced. The first if-check on an "aborted but never started" build wilkl then use the current timestamp to determine the queue time. This will then result to a very large false-positive value if the retrieval was done several hours after the build request was aborted.