kubernetes-mesos icon indicating copy to clipboard operation
kubernetes-mesos copied to clipboard

scheduler HTTP API endpoint bind() fails upon failover

Open jdef opened this issue 10 years ago • 1 comments

HA failover is triggered by:

  • new scheduler master detected
  • scheduler receives SIGUSR1

When a scheduler fails over, a new scheduler process is spawned with the same parameters as the original and assumes responsibility for the stdout, stderr streams of the original scheduler. The old scheduler exits and the new one remains running.

The problem is that there's a race between the old scheduler process dying and the new one starting to listen on the scheduler HTTP API endpoint: bind() fails because the address is already in use.

It would be ideal for the old scheduler to either:

  • release the socket before spawning the new process (easier)
  • stop the listener and send the old socket's FD to the new process (harder)

jdef avatar Mar 30 '15 17:03 jdef

#754

jdef avatar Feb 21 '16 15:02 jdef