kubernetes-mesos
kubernetes-mesos copied to clipboard
scheduler HTTP API endpoint bind() fails upon failover
HA failover is triggered by:
- new scheduler master detected
- scheduler receives SIGUSR1
When a scheduler fails over, a new scheduler process is spawned with the same parameters as the original and assumes responsibility for the stdout, stderr streams of the original scheduler. The old scheduler exits and the new one remains running.
The problem is that there's a race between the old scheduler process dying and the new one starting to listen on the scheduler HTTP API endpoint: bind() fails because the address is already in use.
It would be ideal for the old scheduler to either:
- release the socket before spawning the new process (easier)
- stop the listener and send the old socket's FD to the new process (harder)
#754