kestra icon indicating copy to clipboard operation
kestra copied to clipboard

fix(core/jdbc): enhance of the worker liveness and heartbeat mecanisms (#3055)

Open fhussonnois opened this issue 2 years ago • 3 comments

Fix: #3055

Changes

  • A Worker Instances has a complete state lifecycle which is mainly handled by Executor allowing to have a better handling of some corner cases on Worker shutdown down.
  • Below are the expected state transition with the following defined states is:
                 +--------------+
         +<----- | Running      | -------->+
         |       +------+-------+          |
         |              |                  |
         |              v                  |
         |        +------+-------+     +-------+------+
          +-----> | Terminating  |<----| Disconnected |
                  +------+-------+     +-------+------+
                    |          |
                    v          v
       +------+-------+       +------+-------+
       | Terminated   |       | Terminated   |
       | Graceful     |       | Forced       |
       +--------------+       +--------------+
                     |         |
                     v         v
                   +------+-------+
                   | Not          |
                   | Running      |
                   +--------------+
  worker:
    # The expected time for a worker to complete all of its
    # tasks before initiating a graceful shutdown.
    terminationGracePeriod: 5m
    # Worker liveness configuration when using Kestra with JDBC deployment.
    liveness:
      # Enable/Disable heartbeat/liveness check
      enabled: true
      # The expected time between liveness probe for a workers.
      interval: 3s
      # The timeout used to detect worker failures.
      timeout: 15s
      # The time to wait before executing a liveness probe for a worker.
      initialDelay: 30s
      # The expected time between worker heartbeats to the executor.
      heartbeatInterval: 3s

SQL Schema

This PR introduces the following schema change as the status column was fixed to varchar(10):

ALTER TABLE worker_instance ALTER COLUMN status TYPE varchar(32);

Testing strategies

  • Worker is hard killed

    • Executor detects the disconnected worker after timeout and waits for terminationGracePeriod to re-emit the corresponding trigger/job tasks.
  • Worker is disconnected for more than timeout.

    • Executor detects the disconnected worker after timeout and updates the worker's state to DISCONNECTED
    • The worker re-connects before terminationGracePeriod:
      • The worker detects the DISCONNECTED and transition to TERMINATING.
      • If the worker transitions to TERMINATED_GRACEFUL, then the Executor updates the state to NOT_RUNNING and discard it.
      • If the worker transitions to TERMINATED_FORCED, then the Executor
        • Re-emits the corresponding trigger/job tasks.
        • Updates the state to NOT_RUNNING.
    • The worker re-connects after terminationGracePeriod:
      • The worker detects the NOT_RUNNING or EMPTY and transition to TERMINATING and TERMINATED_FORCED.
  • Worker is hard killed while being in TERMINATING.

    • The Executor e-emits the corresponding trigger/job tasks after terminationGracePeriod.
    • The Executor updates the worker state to NOT_RUNNING and discard it.
  • Worker is shutdown and transition to TERMINATED_GRACEFUL: before terminationGracePeriod.

    • The Executor updates the worker state to NOT_RUNNING and discard it.
  • Worker is shutdown and transition to TERMINATED_FORCED: before terminationGracePeriod.

    • The Executor e-emits the corresponding trigger/job tasks after terminationGracePeriod.
    • The Executor updates the worker state to NOT_RUNNING and discard it.
  • Worker failed to send an heartbeat for more than timeout

    • The worker pro-actively transition to DISCONNECTED, TERMINATING, and so on.

Compatibility, Deprecation, and Migration Plan

Worker states UP, and DEAD are deprecated but still supported by the Executor.

Migrating to this PR depends on the migration strategy. But, we should expect the following behavior:

Executor is upgraded first:

  • Workers in previous versions will continue to heartbeat with UP.
  • Executor will send a legacy DEAD status to those workers if they timeout.
  • If a Timeout Worker respaws after the grace termination period it may experience a deserialization exception if the worker is still in NOT_RUNNING (which should probably not happen). In all cases, this last situation is acceptable, as the worker will stop.

Worker is upgraded first

  • Workers in new version will heatbeat with RUNNING.
  • The previous version's executor will never manage these workers (as their state is unknown) until it is updated too.

fhussonnois avatar Feb 15 '24 14:02 fhussonnois

A worker will re-register itself when an heartbeat failed because the current state is either DEAD or EMPTY (i.e, worker was removed by the executor).

I'm not convinced this is the right choice here as a new worker may have taken over its jobs, that's why we exit it.

loicmathieu avatar Feb 15 '24 16:02 loicmathieu

Discussed in the meeting with @fhussonnois - those will be the states added only for JDBC architecture at first:

  • RUNNING
  • TERMINATING
  • TERMINATED_GRACEFULLY
  • TERMINATED_FORCED
  • DISCONNECTED
  • NOT RUNNING

as a replacement to just UP/DEAD health state: image

anna-geller avatar Feb 19 '24 10:02 anna-geller

A good addition for later would be to make the re-submission of task optional, maybe it would be great to open a followup issue about it.

loicmathieu avatar Mar 18 '24 11:03 loicmathieu