scheduler HTTP API endpoint bind() fails upon failover

Open jdef opened this issue 10 years ago • 1 comments

HA failover is triggered by:

new scheduler master detected
scheduler receives SIGUSR1

When a scheduler fails over, a new scheduler process is spawned with the same parameters as the original and assumes responsibility for the stdout, stderr streams of the original scheduler. The old scheduler exits and the new one remains running.

The problem is that there's a race between the old scheduler process dying and the new one starting to listen on the scheduler HTTP API endpoint: bind() fails because the address is already in use.

It would be ideal for the old scheduler to either:

release the socket before spawning the new process (easier)
stop the listener and send the old socket's FD to the new process (harder)

Mar 30 '15 17:03 jdef

#754

Feb 21 '16 15:02 jdef