openwhisk [New Scheduler] Brief etcd unavailability can result in specific action queue to get stuck if unlucky

I had an action queue get into a stuck state after about two seconds of etcd downtime while other actions were able to recover gracefully. Essentially what it appears happens is that the queue endpoint key times out in etcd and no longer exists, but the controller doesn't hear this and continues to think the scheduler endpoint of the queue is on the same scheduler endpoint (I believe WatchEndpointRemoved should be sent to the controllers in this case but that doesn't seem to have happened). In the QueueManager of the scheduler the activation it was sent to, it then hits this code path because the queue doesn't exist on the host it sent it to and tries to remotely resolve it through etcd but the queue endpoint doesn't exist in etcd:

      case t =>
        logging.warn(this, s"[${msg.activationId}] activation has been dropped (${t.getMessage})")
        completeErrorActivation(msg, "The activation has not been processed: failed to get the queue endpoint.")
    }}

All requests for this action will then be dropped in the QueueManager unless the schedulers are restarted. Is there any way to make this more resilient so that if something gets stuck in this edge case, we can recover somehow without requiring a restart? @style95

Additional logs for the timeline:

How I know connectivity to etcd fails is this logs emits for two seconds from the controller before all activations for that action begin to fail and all other actions become fine again.

[WARN] [#tid_T8Gk2BdDdf4PIjq8W8Ta12kD0ZAOMgwE] [ActionsApi] No scheduler endpoint available [marker:controller_loadbalancer_error:2:0]

Then from the invoker this log is emitted from the containers that exist for the action until the schedulers are restarted:

[ActivationClientProxy] The queue of action [REDACTED] does not exist. Check for queues in other schedulers.

Jul 18 '22 20:07 bdoyle0182

Let me look into it. But I see some issues and questions at first glance.

I found the default lease timeout is 1 second. https://github.com/apache/openwhisk/blob/master/ansible/group_vars/all#L467 An intermittent network rupture can happen at any time, this should be bigger than 1s. It could easily break the system with short network unavailability and we are using 10s in our downstream. I think it is better to update the default.

Regarding the error The activation has not been processed, the queue manager is supposed to retry until it fetches the endpoint. It is supposed to retry up to 13 times with exponential backoff starting with 1ms and the total wait time would be around 8 seconds(1ms + 2ms + 4ms + ... 4096ms). Was there such a log?

Also, when an endpoint is removed while there is actually a queue, the system is supposed to restore the etcd data until the data is explicitly requested to be deleted.

Regarding the error, No scheduler endpoint available, it conforms to the existing behavior of the ShardingPoolBalancer that no retry is performed when there is any issue in Kafka. https://github.com/apache/openwhisk/blob/master/core/controller/src/main/scala/org/apache/openwhisk/core/loadBalancer/CommonLoadBalancer.scala#L210 But I feel it would be better to add a retry mechanism here too.

In general, there was a but in the code and we recently fixed it with the following. https://github.com/apache/openwhisk/pull/5251 Need to see if it could cause any regression.

@ningyougang @jiangpengcheng Do you have any idea?

Jul 19 '22 01:07 style95

Regarding But I feel it would be better to add a retry mechanism here too. I agree.
Do more test for case (make etcd downtime for several seconds while invoking 10 actions) @bdoyle0182 ,can you share your local etcd lease timeout value? and which upstream commit you are deploying? I will test based on your own codes and current key configuration.

Jul 19 '22 01:07 ningyougang

My local etcd lease timeout is 1. I'm updating it to 10 today as recommended by your current downstream.

Regarding the queue manager and retrying 13 times to get the queue lease from etcd with exponential backoff, the issue is not being able to get the lease it's that it no longer exists so there is no longer a queue for this action.

Order of events:

Couple second network blip with etcd results in leases timing out
Etcd connectivity recovers.
However the lease for the queue has failed by this point and the queue thus is removed
The controller is supposed to get notified when a WatchEndpointRemoved occurs for the respective lease so that it updates that it needs to send a new create queue request to a scheduler endpoint. However for whatever reason this doesn't occur and the controller thinks that the queue is still on the old scheduler host.
The controller continues to send activations to this scheduler which no longer has a queue for this action, which falls back to remote resolution so it tries to lookup the lease in etcd to find the correct scheduler to forward this activation to.
It then doesn't know what to do and fails the activation with The activation has not been processed since there is no lease available for the queue.
The controller keeps sending activations to this scheduler forever hitting this case and failing all activations for this action unless the hosts are restarted

Hopefully increasing the lease timeout helps so that these short network issues do not result in losing leases. However that doesn't fix the issue of very specific edge cases not being able to gracefully recover when a lease is inevitable lost and needs to get recreated.

At least for this specific case I have two suggestions,

Is the controller can cache and periodically check the queue lease still exists if not create new queue maybe? That way you won't resolve immediately, but at least can resolve after some time without restart. Though don't know if there's any case there where you could have duplicate queues for an action.
To recover immediately, instead of failing the activation in this case when the lease no longer exists for the action queue when you hit this case can we instead recreate the queue locally in the scheduler it was sent to and then it recovers rather than fail the activation and all subsequent activations? Or is this also a dangerous suggestion?

Jul 19 '22 19:07 bdoyle0182

One thing I couldn't get is when a lease is removed for some reason, the system is supposed to recover(reissue) the lease and all data. Since the network rupture can happen at any time, the system should be resilient to them. When I design the new scheduler, that was the main requirement and we had tested it many times. But it seems the data is not properly recovered in your case and that's my question.

Regarding the cache about the scheduler endpoint, I think that would be a good improvement. Currently, if no scheduler endpoint is found, it will just fail activations. But we can make controllers send activations and queue creation requests to the scheduler side anyway(with cached endpoints as long as it is reachable), and make schedulers handle the remaining.

One thing I can do for now is to update the default configurations.

Aug 01 '22 22:08 style95

So for what it's worth, I've yet to see this issue again since I raised the etcd lease timeout to 10 seconds from 1 second and it's been a couple weeks. So my theory was that 1 second was too short and it was facing a cyclical race condition where it would ask for a new lease, it writes it but by the time it tries to write again the lease has expired again and it's now going be stuck in this loop. That's my theory at least, I don't have proof in the code; but if you want to take a look at the code based on this knowledge that does seem to have fixed it

Aug 02 '22 00:08 bdoyle0182

openwhisk openwhisk copied to clipboard

[New Scheduler] Brief etcd unavailability can result in specific action queue to get stuck if unlucky

openwhisk
openwhisk copied to clipboard