[TRI-1861] Runs can permanently get stuck in queued status if the Redis server is unreachable in some circumstances

Open ericallam opened this issue 1 year ago • 0 comments

We use Redis to handle rate limiting run executions (by org, job, or concurrency group), and there is a specific scenario where runs can get stuck for an org, job, or concurrency group if the Redis server is unreachable between calling the beforeTask and afterTask Lua scripts.

If the beforeTask script adds the flag to the forbidden flags Redis set (tr:exec:forbiddenFlags) which prevents graphile jobs from running that include any members of the set, but then the Redis server becomes unreachable before the afterTask script is invoked, then the flag will never be removed from the tr:exec:forbiddenFlags Redis set and runs associated with the org/job/concurrency group will never get executed.

We've had this happen once when we changed the instance size on our Redis cluster in the Trigger.dev Cloud and our Redis server had a brief (< 30s) period of downtime.

The fix

This could be fixed in a couple ways:

Retry the afterTask script if invoking it throws an error, with a backoff up to 1 minute
Use a sorted set for the tr:exec:forbiddenFlags key and give each forbidden flag a "score" that is the timestamp, and then only include the forbidden flags that are fresh enough when determining which graphile jobs should be be ignored (something like 5 minutes). Then we'd have a "vacuum" job that would remove any forbidden flags older than 60 minutes to keep the set size small.

_TRI-1861

Jan 11 '24 15:01 ericallam