fix(core/jdbc): enhance of the worker liveness and heartbeat mecanisms (#3055)
Fix: #3055
Changes
- A Worker Instances has a complete state lifecycle which is mainly handled by Executor allowing to have a better handling of some corner cases on Worker shutdown down.
- Below are the expected state transition with the following defined states is:
+--------------+
+<----- | Running | -------->+
| +------+-------+ |
| | |
| v |
| +------+-------+ +-------+------+
+-----> | Terminating |<----| Disconnected |
+------+-------+ +-------+------+
| |
v v
+------+-------+ +------+-------+
| Terminated | | Terminated |
| Graceful | | Forced |
+--------------+ +--------------+
| |
v v
+------+-------+
| Not |
| Running |
+--------------+
-
The two main Java classes which are responsible for managing Worker Liveness are:
-
In addition, this PR introduces the following new configuration properties:
worker:
# The expected time for a worker to complete all of its
# tasks before initiating a graceful shutdown.
terminationGracePeriod: 5m
# Worker liveness configuration when using Kestra with JDBC deployment.
liveness:
# Enable/Disable heartbeat/liveness check
enabled: true
# The expected time between liveness probe for a workers.
interval: 3s
# The timeout used to detect worker failures.
timeout: 15s
# The time to wait before executing a liveness probe for a worker.
initialDelay: 30s
# The expected time between worker heartbeats to the executor.
heartbeatInterval: 3s
SQL Schema
This PR introduces the following schema change as the status column was fixed to varchar(10):
ALTER TABLE worker_instance ALTER COLUMN status TYPE varchar(32);
Testing strategies
-
Worker is hard killed
- Executor detects the disconnected worker after
timeoutand waits forterminationGracePeriodto re-emit the corresponding trigger/job tasks.
- Executor detects the disconnected worker after
-
Worker is disconnected for more than
timeout.- Executor detects the disconnected worker after
timeoutand updates the worker's state toDISCONNECTED - The worker re-connects before
terminationGracePeriod:- The worker detects the
DISCONNECTEDand transition toTERMINATING. - If the worker transitions to
TERMINATED_GRACEFUL, then the Executor updates the state toNOT_RUNNINGand discard it. - If the worker transitions to
TERMINATED_FORCED, then the Executor- Re-emits the corresponding trigger/job tasks.
- Updates the state to
NOT_RUNNING.
- The worker detects the
- The worker re-connects after
terminationGracePeriod:- The worker detects the
NOT_RUNNINGorEMPTYand transition toTERMINATINGandTERMINATED_FORCED.
- The worker detects the
- Executor detects the disconnected worker after
-
Worker is hard killed while being in
TERMINATING.- The Executor e-emits the corresponding trigger/job tasks after
terminationGracePeriod. - The Executor updates the worker state to
NOT_RUNNINGand discard it.
- The Executor e-emits the corresponding trigger/job tasks after
-
Worker is shutdown and transition to
TERMINATED_GRACEFUL: beforeterminationGracePeriod.- The Executor updates the worker state to
NOT_RUNNINGand discard it.
- The Executor updates the worker state to
-
Worker is shutdown and transition to
TERMINATED_FORCED: beforeterminationGracePeriod.- The Executor e-emits the corresponding trigger/job tasks after
terminationGracePeriod. - The Executor updates the worker state to
NOT_RUNNINGand discard it.
- The Executor e-emits the corresponding trigger/job tasks after
-
Worker failed to send an heartbeat for more than
timeout- The worker pro-actively transition to
DISCONNECTED,TERMINATING, and so on.
- The worker pro-actively transition to
Compatibility, Deprecation, and Migration Plan
Worker states UP, and DEAD are deprecated but still supported by the Executor.
Migrating to this PR depends on the migration strategy. But, we should expect the following behavior:
Executor is upgraded first:
- Workers in previous versions will continue to heartbeat with
UP. - Executor will send a legacy
DEADstatus to those workers if they timeout. - If a Timeout Worker respaws after the grace termination period it may experience a deserialization exception if the worker is still in
NOT_RUNNING(which should probably not happen). In all cases, this last situation is acceptable, as the worker will stop.
Worker is upgraded first
- Workers in new version will heatbeat with
RUNNING. - The previous version's executor will never manage these workers (as their state is unknown) until it is updated too.
A worker will re-register itself when an heartbeat failed because the current state is either DEAD or EMPTY (i.e, worker was removed by the executor).
I'm not convinced this is the right choice here as a new worker may have taken over its jobs, that's why we exit it.
Discussed in the meeting with @fhussonnois - those will be the states added only for JDBC architecture at first:
- RUNNING
- TERMINATING
- TERMINATED_GRACEFULLY
- TERMINATED_FORCED
- DISCONNECTED
- NOT RUNNING
as a replacement to just UP/DEAD health state:
A good addition for later would be to make the re-submission of task optional, maybe it would be great to open a followup issue about it.
Quality Gate passed
Issues
32 New issues
0 Accepted issues
Measures
0 Security Hotspots
81.8% Coverage on New Code
0.0% Duplication on New Code