fluentd icon indicating copy to clipboard operation
fluentd copied to clipboard

Update/Reload without downtime

Open daipom opened this issue 1 year ago • 1 comments

Which issue(s) this PR fixes:

  • Fixes #4622

What this PR does / why we need it: See #4622.

Specification:

  1. The supervisor receives SIGUSR2.
  2. Spawn a new supervisor.
  3. Take over shared sockets.
  4. Launch new workers, and stop old processes in parallel.
    • Launch new workers with source-only mode
      • Limit to restart_without_downtime_ready? input plugin
    • Send SIGTERM to the old supervisor after 10s delay from 3.
  5. The old supervisor stops and sends SIGRTMIN(34) to the new one.
  6. The new workers run fully.

Screenshot from 2024-10-11 09-38-28

Supported input plugins:

  • in_tcp
  • in_udp
  • in_syslog

Needs following:

  • #4661
    • (Included in this branch temporarily)
  • https://github.com/treasure-data/serverengine/pull/146

Docs Changes: TODO

Release Note: TODO

TODO:

  • [ ] Some implementation TODO referred in code comment.
  • [ ] Tests
  • [ ] Document

daipom avatar Aug 30 '24 07:08 daipom

The basic implementation is done. Some concept of #4654 is reflected. Thanks @Watson1978!

daipom avatar Oct 11 '24 01:10 daipom

Thanks for your review!

daipom avatar Nov 27 '24 02:11 daipom

during zeroDowntimeRetart, other HTTP endpoints result in non-guarded state. it it intentional?

kenhys avatar Nov 27 '24 02:11 kenhys

during zeroDowntimeRetart, other HTTP endpoints result in non-guarded state. it it intentional?

Yes. The old Fluentd should continue to work as is until it receives SIGTERM at 4.. (Even if the new Fluentd does not work as expected).

The new Fluentd RPC starts at 5., so there is no conflict.

If the old Fluentd receives /api/processes.killWorkers, it causes just a quick transition to 5..

daipom avatar Nov 27 '24 03:11 daipom

Thanks for your review!

daipom avatar Nov 28 '24 04:11 daipom