Update/Reload without downtime
Which issue(s) this PR fixes:
- Fixes #4622
What this PR does / why we need it: See #4622.
Specification:
- The supervisor receives SIGUSR2.
- Spawn a new supervisor.
- Take over shared sockets.
- Launch new workers, and stop old processes in parallel.
- Launch new workers with source-only mode
- Limit to restart_without_downtime_ready? input plugin
- Send SIGTERM to the old supervisor after 10s delay from 3.
- Launch new workers with source-only mode
- The old supervisor stops and sends SIGRTMIN(34) to the new one.
- The new workers run fully.
Supported input plugins:
in_tcpin_udpin_syslog
Needs following:
- #4661
- (Included in this branch temporarily)
- https://github.com/treasure-data/serverengine/pull/146
Docs Changes: TODO
Release Note: TODO
TODO:
- [ ] Some implementation TODO referred in code comment.
- [ ] Tests
- [ ] Document
The basic implementation is done. Some concept of #4654 is reflected. Thanks @Watson1978!
Thanks for your review!
during zeroDowntimeRetart, other HTTP endpoints result in non-guarded state. it it intentional?
during zeroDowntimeRetart, other HTTP endpoints result in non-guarded state. it it intentional?
Yes.
The old Fluentd should continue to work as is until it receives SIGTERM at 4..
(Even if the new Fluentd does not work as expected).
The new Fluentd RPC starts at 5., so there is no conflict.
If the old Fluentd receives /api/processes.killWorkers, it causes just a quick transition to 5..
Thanks for your review!