brod
brod copied to clipboard
[rfc] redesign the process management tree
supervisor3 was a bad choice
supervisor3is stale, lagging behind with the changes in OTP's supervisor.erl- desired features from
supervisor3can be addressed withsupervisorwith some additional retry mechanism.- children restarts do not cause supervisor restart (this is not really necessary as the children are long lived in
brod) - delayed restart (need our own delay-retry loop)
- children restarts do not cause supervisor restart (this is not really necessary as the children are long lived in
not following OTP design principal
brod_client directly spawn_link consumer and producer supervisor makes the processes unreachable from application root supervisor. i.e. it's not a tree.
this makes the release handler unable to traverse all processes from the supervision tree to perform hot-upgrades.
design proposal
- Use supervision tree for all processes.
brod_clientshould run in a delayed-retry loop to startbrod_producers_supand/orbrod_consumers_supper configuration- Once a
brod_producers|consumers_supis started, it should not die at connection restarts (in fact this is the current behavior, it runs in a loop to to poll new leader connection frombrod_client.
brod_sup
|
+-- brod_clients_sup
| |
| +-- brod_client (worker)
|
+-- brod_producers_sup
| |......
|
+-- brod_consumers_sup
|......
Once a brod_producers|consumers_sup is started, it should not die at connection restarts (in fact this is the current behavior, it runs in a loop to to poll new leader connection from brod_client.
How are producers and consumers expected to behave when a client dies?
Once a brod_producers|consumers_sup is started, it should not die at connection restarts (in fact this is the current behavior, it runs in a loop to to poll new leader connection from brod_client.
How are producers and consumers expected to behave when a client dies?
For consumers, it's just a matter of time to wait for a new connection to be established to partition leader(s). For producers, they will need to have a backlog of pending produce requests, in the backlog:
- a request may expire (if it's sent via API with timeout)
- a request might be dropped after the backlog reaches capacity limit.
This is an "RFC" so here's a "C", heh!
I found myself on this issue after tracking down some error logs my app occasionally emits when shutting down as part of new deploys. The logs look like this:
12/20/2023 7:17:21.434 PM +0000 Last message: {:EXIT, #PID<0.28018.0>, :killed}
19:17:21.434 [][error] GenServer #PID<0.28046.0> terminating ** (stop) killed
From dumping the process IDs prior to the deploy, I cross referenced the PIDs and see that 0.28018.0 is a supervisor3 and 0.28046.0 is a brod_producer. There's typically a bunch of them logged like this, all belonging to the same supervisor. The kafka topic has 64 partitions, and we might get ~10 producers logging like this.
This is in response to a SIGTERM the app receives, since we don't do hot upgrades; just bring up the new version in a new instance and bring down the old one. NB: This means, there might be some re-balancing happening at the time that the app is trying to shut down.
It's not the worst thing, but I would like for the app to shut down cleanly.
I know the general way Erlang apps shut down is recursively through the supervision tree in the reverse order. And I know an OTP supervisor will try to gently stop a child with a timeout as configured by its child_spec, and then kill it if it fails. And I know the official guidelines are to set the timeout for supervisors to infinity, to give them time to shut down their subtree, in contrast to workers which default to 5 seconds.
I don't know exactly how supervisor3 compares to the official OTP supervisors and if they follow a similar practice.
And I also don't know how best to handle brod_client, which according to the (excellently diagrammed!) process tree on the brod readme, is a worker that has supervisors as children. I suppose it shouldn't have the default 5 seconds shutdown timeout, since it has children with infinite timeouts.
I haven't quite worked out the timing and exit signals, but I'm currently thinking that my error logs have to do with brod_client taking too long to shut down when the app receives a SIGTERM, and then being killed by the BEAM. But as I think about how it should be configured, I get a little confused. Should its exit timeout be infinity like a supervisor? Or finite (but maybe longer than the default 5 seconds) since it's also a worker with TCP connections and such?
So I support the process architecture in the RFC here, if for no other reason than it's a bit easier for me to reason about when shutting down the application.