brod icon indicating copy to clipboard operation
brod copied to clipboard

[rfc] redesign the process management tree

Open zmstone opened this issue 2 years ago • 3 comments

supervisor3 was a bad choice

  • supervisor3 is stale, lagging behind with the changes in OTP's supervisor.erl
  • desired features from supervisor3 can be addressed with supervisor with some additional retry mechanism.
    • children restarts do not cause supervisor restart (this is not really necessary as the children are long lived in brod)
    • delayed restart (need our own delay-retry loop)

not following OTP design principal

brod_client directly spawn_link consumer and producer supervisor makes the processes unreachable from application root supervisor. i.e. it's not a tree. this makes the release handler unable to traverse all processes from the supervision tree to perform hot-upgrades.

design proposal

  • Use supervision tree for all processes.
  • brod_client should run in a delayed-retry loop to start brod_producers_sup and/or brod_consumers_sup per configuration
  • Once a brod_producers|consumers_sup is started, it should not die at connection restarts (in fact this is the current behavior, it runs in a loop to to poll new leader connection from brod_client.
brod_sup
  |
  +-- brod_clients_sup
  |          |
  |          +-- brod_client (worker)
  |
  +-- brod_producers_sup
  |           |......
  |
  +-- brod_consumers_sup
              |......

zmstone avatar Jan 01 '23 14:01 zmstone

Once a brod_producers|consumers_sup is started, it should not die at connection restarts (in fact this is the current behavior, it runs in a loop to to poll new leader connection from brod_client.

How are producers and consumers expected to behave when a client dies?

v0idpwn avatar Jan 02 '23 06:01 v0idpwn

Once a brod_producers|consumers_sup is started, it should not die at connection restarts (in fact this is the current behavior, it runs in a loop to to poll new leader connection from brod_client.

How are producers and consumers expected to behave when a client dies?

For consumers, it's just a matter of time to wait for a new connection to be established to partition leader(s). For producers, they will need to have a backlog of pending produce requests, in the backlog:

  • a request may expire (if it's sent via API with timeout)
  • a request might be dropped after the backlog reaches capacity limit.

zmstone avatar Feb 17 '23 12:02 zmstone

This is an "RFC" so here's a "C", heh!

I found myself on this issue after tracking down some error logs my app occasionally emits when shutting down as part of new deploys. The logs look like this:

12/20/2023 7:17:21.434 PM +0000 Last message: {:EXIT, #PID<0.28018.0>, :killed}
      19:17:21.434 [][error] GenServer #PID<0.28046.0> terminating ** (stop) killed

From dumping the process IDs prior to the deploy, I cross referenced the PIDs and see that 0.28018.0 is a supervisor3 and 0.28046.0 is a brod_producer. There's typically a bunch of them logged like this, all belonging to the same supervisor. The kafka topic has 64 partitions, and we might get ~10 producers logging like this.

This is in response to a SIGTERM the app receives, since we don't do hot upgrades; just bring up the new version in a new instance and bring down the old one. NB: This means, there might be some re-balancing happening at the time that the app is trying to shut down.

It's not the worst thing, but I would like for the app to shut down cleanly.

I know the general way Erlang apps shut down is recursively through the supervision tree in the reverse order. And I know an OTP supervisor will try to gently stop a child with a timeout as configured by its child_spec, and then kill it if it fails. And I know the official guidelines are to set the timeout for supervisors to infinity, to give them time to shut down their subtree, in contrast to workers which default to 5 seconds.

I don't know exactly how supervisor3 compares to the official OTP supervisors and if they follow a similar practice.

And I also don't know how best to handle brod_client, which according to the (excellently diagrammed!) process tree on the brod readme, is a worker that has supervisors as children. I suppose it shouldn't have the default 5 seconds shutdown timeout, since it has children with infinite timeouts.

I haven't quite worked out the timing and exit signals, but I'm currently thinking that my error logs have to do with brod_client taking too long to shut down when the app receives a SIGTERM, and then being killed by the BEAM. But as I think about how it should be configured, I get a little confused. Should its exit timeout be infinity like a supervisor? Or finite (but maybe longer than the default 5 seconds) since it's also a worker with TCP connections and such?

So I support the process architecture in the RFC here, if for no other reason than it's a bit easier for me to reason about when shutting down the application.

losvedir avatar Dec 26 '23 11:12 losvedir