sat icon indicating copy to clipboard operation
sat copied to clipboard

Constd startup / shutdown with load shedding

Open javorszky opened this issue 2 years ago • 2 comments

This came up as part of trying to solve #71 . The test itself passes, but when it's time to tear down constd, it's not happening. Thus began my deep dive into how the individual services handle startup and shutdown.

What is constd doing?

Constd starts up, then does two things periodically (every second):

  • reconciles atmo
  • reconciles sats

In super simple terms, it starts up a single atmo-proxy server, and a bunch of sat instances that then talk to the atmo. They all run in an infinite loop until we terminate them somehow. Usually that termination is a ctrl+c, a SIGINT.

What's the issue?

During the test I have constd running as an exec.Cmd. Ideally you can send a termination signal to it with cmd.Process.Kill(), except for whatever reason, this does not happen in this case.

Why does this (not) happen?

At the very base of the issue is the fact that all services (constd (in my branch), sat, and atmo) listen to the sigint and signterm signals with a channel of type os.Notify, and a signal.Notify piped into that channel for sigterm and sigint signals.

When constd receives the signal, it should capture that, and then decide what to do with it. Because it has dependent / child processes (all the sats, and the atmo), it should orchestrate how to tear them down before finally quitting itself.

Because all services listen to the same signals, the moment the host sends that signal, every running service is going to receive it, and they will all start tearing themselves down in whatever fashion they want. This usually leads to constd not really understanding why there aren't sats / atmo running any more, or just not realising that something has happened.

What should be happening instead?

The outernmost service should be the only one that captures the signals, and then it should signal all the children instances that they should be tucking into bed right now. An example minimal implementation is here: https://github.com/javorszky/gofork

The conflict there is whether we want to keep atmo / the sats running even after constd dies in such a way that a new instance of constd can pick up where it left off.

javorszky avatar May 17 '22 13:05 javorszky

the moment the host sends that signal, every running service is going to receive it

I'm fairly certain if you invoke kill -15 <pid> only the provided pid will receive the signal.

If you use Ctrl+C however it will kill send the signal to the entire foreground group

To make matters worse I'm pretty sure the behavior of Ctrl+C is shell dependent 😢

If we want sat/atmo instance lifecycles bound to constd's lifecycle then I think running the instances in the same process as go routines makes sense.

If we want to be able to restart constd without disrupting the sat/atmo instances it manages we may want to consider something different.

For instance, if constd maintains a list of pids its managing could (golang permitting) use the wait system call to halt execution until the child process has exited.

rnpridgeon avatar May 17 '22 14:05 rnpridgeon

Turns out that the sigint / sigterm is received, except the profile deletion / service picking up that it should stop working is borked. Or rather we've confirmed that the signal is being received. Further sleuthing coming up.

javorszky avatar May 17 '22 15:05 javorszky