worker A robust way to warn about potentially stuck jobs based on statistical data

Feature description

Arguably, you always want to know when a job gets stuck. Often, deadlocks are invisible and with a high amount of tasks it's not easy to track what got stuck and when.

I suggest we implement a statistic-based monitor of tasks that warns if a task executes over its typical runtime. Additionally, since for a given job, the durations might vary dramatically, I suggest to add an abstract value called "cost" which could calculated based on the payload ((payload: Payload) => number).

We would add an average_time column to the jobs table that would contain the average runtime for a given job with the cost of 1.0.

After the first and each subsequent run, graphile worker would set the average_time to the average run time for a given job normalized to the cost of 1.

In the graphile_worker runloop we would have to check for jobs that belong to the current worker and that are running longer than the average time (again normalized to the cost of one) + some threshold. If the runtime surpasses the estimated end_time, graphile worker would produce a warn log.

Motivating example

Assume you had a job that exhausted a connection pool to the db and didn't release the connection. The subsequent tasks will be forever waiting for the connection.

Supporting development

[ ] am interested in building this feature myself
[x] am interested in collaborating on building this feature
[ ] am willing to help testing this feature before it's released
[ ] am willing to write a test-driven test suite for this feature (before it exists)
[ ] am a Graphile sponsor ❤️
[ ] have an active support or consultancy contract with Graphile

Extra notes

I am planning to implement a similar feature in the user space. I would consider making a contribution to graphile worker instead but I must say that the development process + quality and style requirements make me a bit worried about the time investment required and the timeline at which the feature would be released. I completely understand where it's coming from and I very much appreciate your effort but I just want to be transparent about why I'm not necessarily considering contributing this to graphile worker. It's not because I don't want to contribute.

Sep 24 '22 14:09 wokalski

Really interesting idea! I think user space is probably the right place for this, at least initially. I'd want to see it grow, be used, and consensus from many users that it's a) useful, and b) the right solution before considering putting it into core. There's deliberately a pretty high bar to getting features into core, and it feels to me that this one is not a slam-dunk (for example, I don't think I would use it, personally).

In addition to the approach that you propose, I'd like to see an approach based on the event system, maybe with a central broker (or tracking table) being updated when things out of the ordinary happen. I'd also consider allowing users to manually set "metadata" on their tasks indicating the expected runtime and max allowable runtime.

When implementing this in user space, I suggest that you do so in separate tables (outside of the graphile_worker schema) so that you remain compatible with core in future. If you need additional hooks/events to enable certain capabilities I'm certainly happy to consider those as smaller, isolated changes 👍

Sep 24 '22 15:09 benjie

Not sure if you ever open-sourced your implementation of this @wokalski, but if you did I'd love to see it!

Oct 24 '23 11:10 benjie

worker worker copied to clipboard

A robust way to warn about potentially stuck jobs based on statistical data

Feature description

Motivating example

Supporting development

Extra notes

worker
worker copied to clipboard