milo icon indicating copy to clipboard operation
milo copied to clipboard

Execution of SessionFsm transition actions can create a deadlock

Open persicsb opened this issue 1 year ago • 2 comments

Describe the bug SessionFsm transition actions are not properly executed in order, and it can cause a deadlock.

The main use case is resubscribing subscriptions, that were failed to transfer to a new session. There are 2 main points on the client side, when we are notified to do it:

  1. SessionActivityListener.onSessionActive - it is assumed that we are safe to work with anything session related, lik subscribing to nodes
  2. ManagedSubscription.StatusListener.onSubscriptionTransferFailed - it is assumed here, that everything can be done with the SubscriptionManager here.

However, this is not the case, we can be in a deadlock easily.

The main problem is, that onSessionActive and onSubscriptionTransferFailed callbacks can run before the session's SessionFuture (with the key KEY_SESSION_FUTURE) is properly resolved.

The reason is, that the SessionFsm is set up in SessionFsmFactory so that the session's future is only resolved when the InitializeSuccess event is called between the Intializing and Active state transition, see https://github.com/eclipse/milo/blob/master/opc-ua-sdk/sdk-client/src/main/java/org/eclipse/milo/opcua/sdk/client/session/SessionFsmFactory.java#L513

It resolves the session's future via the executorservice, and not directly, see line 565.

However, the onSessionActive callbacks are called directly, when the state changes to Active from any other state, in https://github.com/eclipse/milo/blob/master/opc-ua-sdk/sdk-client/src/main/java/org/eclipse/milo/opcua/sdk/client/session/SessionFsmFactory.java#L587

Since these callse are not made through the client's executorservice, they can happen-before the sessionFuture is resolved.

So there is no guarantee, that even after onSessionActive is called, anything can be done, high-level APIs like ManagedSubscription.create() will hang indefinitely, as they are in a deadlock : they are waiting for the future's resolution, but that is scheduled later.

I think, the issue is the same as https://github.com/eclipse/milo/issues/614

In that issue, you mentioned that:

I'll start looking at whether I can make these callbacks safer. In the meantime, when you're executing inside a callback, whether from the subscription manager or just a CompletableFuture for a response, just make sure not to block waiting for another response on the same thread.

What is the preferred way to solve this issue? When and how can ManagedSubscription.create() (and any reentrant calls to getSession()) safely called?

Also, it needs to be documented properly in the Javadocs, and perhaps in the milo-examples code.

Expected behavior SessionFsm transition actions have a well-defined order.

Additional context Milo version 0.6.14, but affects all others.

persicsb avatar Aug 26 '24 07:08 persicsb

This is still a problem that I need to figure out how to solve nicely.

I don't remember if I ever looked into moving the onSessionActive and onSessionInactive callbacks to the client executor, that would certainly be an easy fix for that specific scenario if it's safe.

In our client implementation using Milo we don't use any callbacks to determine when to start creating subscriptions. You "know" it's safe to create a Subscription if your call to OpcUaClient::connect or OpcUaClient::getSession succeeds.

kevinherron avatar Aug 26 '24 13:08 kevinherron

But getSession() may behave as it never returns (that is the deadlock).

We are using Siemens S7-1500 PLCs, that do not implement the subscription transfer service, so onSubscriptionTransferFailed is called for every subscription to those PLCs.

However, sometimes (as it really depends on the scheduling of the transition actions, and not deterministic) getSession() remains stuck, and its future will never be resolved.

It can happen, that the connection and reconnection/resubscriptions works for months, and after that, it deadlocks every day. It is really not deterministic.

The real problem is, that a possible workaround might be to create a new client to the PLCs with a fresh connection. But it would be nice to close the old connection, since the PLCs have a limited amount of resource to serve the OPC UA clients. But if getSession() deadlock, disconnect() deadlocks as well, because internally it calls sessionFsm.closeSession(), that won't execute properly.

This is a really annoying issue, which makes Milo quite unusable in environments, where PLCs without subscription transfer restart often.

persicsb avatar Sep 03 '24 06:09 persicsb

Thank you, Kevin, this is wonderful. Can Milo users expect a new 0.6.15 release or shall we wait until 1.0?

persicsb avatar Nov 29 '24 14:11 persicsb

@persicsb I will cut a 0.6.15 release within a week, there are a few other small fixes that have gone into 0.6.x in addition to this one.

kevinherron avatar Nov 29 '24 14:11 kevinherron

@persicsb 0.6.15 should be available in Maven Central since yesterday.

kevinherron avatar Dec 04 '24 14:12 kevinherron