ockam icon indicating copy to clipboard operation
ockam copied to clipboard

Improve secure channels to cleanup any spawned workers

Open adrianbenavides opened this issue 2 years ago • 19 comments

When creating a secure channel, multiple workers are spawned but only the address of the top-level worker is returned, making it impossible to remove all used resources.

We use "temporary secure channels" in ockam_api/src/cloud, and right now, every time we call one of these functions, we increase the worker count on the node. We need to find a different approach to create secure channels to release all the acquired resources.

adrianbenavides avatar Aug 07 '22 16:08 adrianbenavides

Hi, a noob question: is it about delete_secure_channel(...) ? Context that sends a graceful shutdown request to the router ? And spawned workers are supposed to follow the same shutdown procedure ?

Retamogordo avatar Aug 09 '22 18:08 Retamogordo

@Retamogordo yes, create_secure_channel spawns a few workers .. there is delete_secure_channel here which naively just stops the worker at the address being used to send messages to the channel.

We need a control protocol that will tell the secure channel to do a graceful shutdown of all workers that it spawned then shut itself down

mrinalwadhwa avatar Aug 09 '22 19:08 mrinalwadhwa

Hi @mrinalwadhwa, it looks like that the "parent" context needs a mechanism of keeping track its "children" entities ? Does it refer to secure_channel only or to general case of any kind of context ? It looks like the way the router functions now assumes that only one graceful shutdown with ack is possible, while all others that share the same cloned sender will get an error because of dropped receiver ?

Retamogordo avatar Aug 09 '22 20:08 Retamogordo

it looks like that the "parent" context needs a mechanism of keeping track its "children" entities

@Retamogordo exactly, the long term plan is to model something similar to Erlang's Supervision Trees https://www.erlang.org/doc/design_principles/des_princ.html#supervision-trees https://www.erlang.org/doc/man/supervisor.html https://elixir-lang.org/getting-started/mix-otp/supervisor-and-application.html https://hexdocs.pm/elixir/Supervisor.html

@spacekookie had a prototype of this some time ago but weren't able to take it further at the time. Here's the prototype from back then https://github.com/build-trust/ockam/pull/968/files if you would like to explore the idea further.

mrinalwadhwa avatar Aug 18 '22 00:08 mrinalwadhwa

Hi @mrinalwadhwa ! Thank you for the links, I took a brief glance on them, soon I'm going to explore them more thoroughly. Is it an option to use third party code for this pattern implementation ? I came across bastion which looks promising, they support both tokio and async_std and even claim that it could be compatible with no_std.

Retamogordo avatar Aug 19 '22 00:08 Retamogordo

@Retamogordo Thank you for spending time on this.

Is it an option to use third party code for this pattern implementation?

It's an option but depending on bastion would be a major change/discussion since we have specific thoughts on how we want to evolve workers and routing for message flow authorization etc. For now it may be better to explore small incremental improvements to the current worker design.

mrinalwadhwa avatar Aug 19 '22 08:08 mrinalwadhwa

Hi @mrinalwadhwa ! I'm trying to wrap my head around the workers/router machinery..(still far from being there). However a code snippet attracted my attention:

pub async fn start(self, context: &Context) -> Result<Address> {
        info!(
            "Initializing ockam worker with access control: {:?}",
            self.mailboxes.main_mailbox().access_control(),
        );

        let mailboxes = self.mailboxes;
        let addresses = mailboxes.addresses();
        let main_address = mailboxes.main_address().clone();

        // Pass it to the context
        let (ctx, sender, ctrl_rx) = Context::new(
            context.runtime().clone(),
            context.sender().clone(),
            mailboxes,
            None,
        );

        // Then initialise the worker message relay
        WorkerRelay::<W, M>::init(context.runtime(), self.worker, ctx, ctrl_rx);

        // Send start request to router
        let (msg, mut rx) =
            NodeMessage::start_worker(addresses, sender, false, context.mailbox_count());
        context
            .sender()
            .send(msg)
            .await
            .map_err(|e| Error::new(Origin::Node, Kind::Invalid, e))?;

        // Wait for the actual return code
        rx.recv()
            .await
            .ok_or_else(|| NodeError::NodeState(NodeReason::Unknown).internal())??;

        Ok(main_address)
    }

Apparently there is a potential condition where the worker is rejected by the router or even the router is shutting down. At this point the function returns an error, but the worker relay has been already launched its loop. As far as I understand from code (it's likely I'm wrong) the break condition of that loop implies that a ctrl signal is explicitly sent by a router (or whatever entity that holds a clone of ctrl_tx), but I think this signal is not sent upon rejecting. In case of the router is shutting down, it looks like the signal is not sent either. So it could lead to the situation when the loop keeps running but the worker has not been started. I wrote a couple of lines to fix it, I ran all the tests and they pass. At least it seems I didn't corrupt things by this change.
My apologies if it's a false alarm, still learning.

Retamogordo avatar Aug 28 '22 00:08 Retamogordo

@Retamogordo Thank you for digging into this. Interesting 🤔 .. we're working toward a release this coming week. I'll think about this a bit over the next day or two. Could you try to come up with a failing test case along with your fix.

@SanjoDeundiak @antoinevg @twittner may have thoughts.

mrinalwadhwa avatar Aug 28 '22 00:08 mrinalwadhwa

Oh, maybe I was unclear, I didn't find this thing by running anything, I just was looking how the system is built. I mean, I ran all the tests after I introduced the code modification just to see I didn't screw anything. But still don't have a failing test written for this case. But I can think of how to test something relevant about it, if this is what you mean.

Retamogordo avatar Aug 28 '22 00:08 Retamogordo

But I can think of how to test something relevant about it, if this is what you mean.

That was exactly what I was trying to covey. A simple test that we know will fail would be very helpful.

mrinalwadhwa avatar Aug 28 '22 00:08 mrinalwadhwa

Ok

Retamogordo avatar Aug 28 '22 00:08 Retamogordo

Ah, if you are to look at this, please check also:

       loop {
            crate::tokio::select! {
                result = self.recv_message() => {
                    match result {
                        // Successful message handling -- keep running
                        Ok(true) => {},
                        // Successful message handling -- stop now
                        Ok(false) => {
                            break;
                        },
                        // An error occurred -- log and continue
                        Err(e) => error!("Error encountered during '{}' message handling: {}", address, e),
                    }
                },
                result = ctrl_rx.recv() => {
                    if result.is_some() {
                        debug!("Relay received shutdown signal, terminating!");
                        break;
                    }

                    // We are stopping
                }
            };
        }

At this snippet:

               result = ctrl_rx.recv() => {
                    if result.is_some() {
                        debug!("Relay received shutdown signal, terminating!");
                        break;
                    }

when channel is gone, recv() returns None, and this case is not covered. If I'm not wrong this could mean that the router is down ? So, the same thing, no return from the loop.

Retamogordo avatar Aug 28 '22 01:08 Retamogordo

Hi @mrinalwadhwa, I think at least one scenario with stopped router is a false alarm, it is covered by the break statement in the worker_relay loop, sorry for trouble. I'll still check what happens on rejection, though.

Retamogordo avatar Aug 28 '22 14:08 Retamogordo

Oh no trouble at all, this kind of thinking and exploration is incredibly valuable feedback!! Thank you.

mrinalwadhwa avatar Aug 28 '22 14:08 mrinalwadhwa

Hi @mrinalwadhwa, I must retire my statements regarding a possible issues in that code, I didn't notice that there is a significant timeout before the node shutdown that must suffice for cleanup. The only potential race condition could only emerge with usage of stop_now with a multi-threaded executor. I don't know how to emulate such an environment, so I can't produce a comprehensive test for my conjecture. Anyway this must be kind of a rare corner case. Sorry for messing things up.

Retamogordo avatar Aug 29 '22 00:08 Retamogordo

significant timeout before the node shutdown

This may have been a hack to hide some errors while we focused on other pieces. So revising and fixing race conditions and aiming for a more graceful node shutdown I think is a worthy cause to pursue. I think this will also help us as we explore implementing supervision trees and recovery strategies.

mrinalwadhwa avatar Aug 29 '22 01:08 mrinalwadhwa

As for supervision trees, do you guys have an ongoing discussion on the topic ? This was why I started to explore your code, I was wandering about how that thing about supervision trees could be applied in practice. It looks like the "router" is a good candidate for a root supervisor ? Maybe it could be possible to leverage Rust's scopes and oneshot channels in order to simplify the protocol of shutdown, for example. Dropping channels on one side guarantees the event loop break on the other. Could possibly prevent complex sending messages back and force and timeouts. And workers could run the exactly the same loops as the router to become a child supervisor and so forth. Thus isolating subtrees if needed. I don't know, just some thoughts. I didn't explore enough what supervision trees are.

Retamogordo avatar Aug 29 '22 01:08 Retamogordo

I don't think there's been much discussion yet around the implementation of supervision trees in the runtime apart from @spacekookie 's PR.

I agree with you that using worker channel lifetimes to manage shutdown is the correct approach though implementation may be non-trivial with the way the code is currently organized.

As I understand it Router is a linear map between an Address and the tokio::Sender channel for its worker.

The relationship between Router and a worker's WorkerRelay is purely "forward" - i.e. it has a control channel for propagating events to WorkerRelay but it has no message loop or life-cycle of its own to process events coming back.

I think this was the main reason we're using the manual Context::set_cluster as an interim mechanism to manage life-cycle dependencies between workers.

Before diving too much deeper into this it would probably be a good exercise to come up with some sequence diagrams that document the interactions between:

0#app | (SmallSender, SmallReceiver) | Context | WorkerRelay | Runtime | Router

…for each of the three life-cycle stages: 1. Starting Worker, 2. Running Worker, 3. Shutting down worker

antoinevg avatar Aug 29 '22 10:08 antoinevg

@antoinevg, thank you for the reply, yes, I think that yet I only understand a tiny tip of the system design, I just imagined that the router was the main loop in the node's lifecycle that received commands from outer world for workers control. I also realized from the code that my vision was possibly just upside down :) I also noticed that, as you mention, there was no a closed circuit of message interchange between the Router and the WorkerRelay. I will continue to play around with the code so I have a chance to learn better how the things work.

Retamogordo avatar Aug 29 '22 17:08 Retamogordo

This one was fixed a while ago

SanjoDeundiak avatar Aug 20 '23 15:08 SanjoDeundiak