DLRN icon indicating copy to clipboard operation
DLRN copied to clipboard

[RFE] Mechanism to expose last execution start and end timestamp

Open evallesp opened this issue 2 years ago • 3 comments

It would be great to expose start and end last execution timestamps in the response of /metrics API. With this,It would be possible to create more specific Prometheus rules which allows to warn:

  • DLRN get stuck in old stable branches.
  • Execution time exceed a maximum length time (possible resources fault).

Prometheus rules are based on the number of total builds processed from last day. Based on it, old stable branches may produced false positives.

evallesp avatar Mar 24 '22 17:03 evallesp

one idea for DLRN "liveliness" check would be to check timestamp on dlrn-logs, every DLRN run creates a new file in that folder and modify time gets updated automatically, so we could have a custom metric in Prometheus taking that value: stat --format=%Y /home/$DLRN_worker/dlrn-logs

apevec avatar Mar 24 '22 18:03 apevec

I've been discussing with @evallesp and I'd say that a simple enough implementation would be to make metrics api to return the running time of current dlrn process working or 0 if it's not working currently. Implementing it with psutil python module is pretty simple, shouldn't have impact on api call performance and doesn't requires major changes in base dlrn functionalities or API. WDYT?

amoralej avatar Mar 25 '22 10:03 amoralej

Here is the proposed change: https://softwarefactory-project.io/r/c/DLRN/+/24432/

I really like your idea @amoralej

evallesp avatar Mar 25 '22 14:03 evallesp

proposed change: https://softwarefactory-project.io/r/c/DLRN/+/24432/ was abandoned with justification

We are going to change the way to check if the worker gets stuck. No api modification needed.

apevec avatar Oct 24 '23 16:10 apevec