opcua-asyncio icon indicating copy to clipboard operation
opcua-asyncio copied to clipboard

How to properly handle connection issues and reconnecting? (request for comments)

Open cerlestes opened this issue 8 months ago • 9 comments

Hello everyone!

First off, thanks a lot to every contributor of this repository; it's a great library that has helped us out tremendously in multiple projects. Secondly, I hope it's okay that I'm using an issue to open a discussion. I'd like to gather some insights from people who are more knowledgable than I am about OPC-UA and this library, hoping that I'll be able to contribute a well-rounded feature out of this discussion in the future.

The topic is handling connection issues and reconnecting properly. Right now, whenever our application loses the connection to the OPC-UA server, for example because the PLC config changed and it's reloading the server, we're reconnecting the client from our application once we try to interact with a node and it fails (we catch the UaError and simply try connecting up to a few times). This was fine until subscriptions came into play. With subscriptions, I'm really having a hard time finding the proper way to detect issues, reconnect and restart the subscriptions.

I've found the Client._monitor_server_loop() method, which is started as a task into Client._monitor_server_task. Once the connection dies, it'll inform the subscriptions of the BadShutdown. This seems to be about the only way to be informed about a connection issue other than emulating that behaviour externally to the client, polling and catching errors when they are raised. Another method of detecting connection issues is the Client.check_connection() method. But again, this method must be polled from the application external to the client.

I think ideally the client itself should provide a mechanism to allow applications to react to connection issues and states in general, i.e. callback when the client lost the connection. On top of that, it should then implement an optional reconnect mechanism that, when enabled, automatically attempts to reconnect upon losing connection, including restoring any subscriptions.

My current proposal would be the following:

  • Add three asyncio.Event instances Client.connected, Client.disconnected, Client.failed. These events are set() when the respective connection state is reached and clear()-ed when the respectice state is left. This would allow application code to simply await client.connected.wait() before each interaction with the client. It would also allow to run error handler tasks once the connection fails with await client.failed.wait().
  • Maybe add a set of methods Client.add_connected_callback(), Client.add_disconnected_callback(), Client.add_failed_callback() to register callback functions which are called once the respective state is reached.
  • Add a new optional parameter to Client() which could be as simple as auto_reconnect: bool = False.
  • Whenever the auto_reconnect is enabled, an additional task Client._auto_reconnect_task will be created by the client upon connecting, which continously calls Client.check_connection() similiar to how the Client._monitor_server_loop() works, and in case of an error automatically tries connecting the client again.
  • Probably a bit more configuration is required for that feature, so maybe add a dataclass AutoReconnectSettings. The following settings come to mind:
    • How often to try reconnecting before giving up
    • How long to wait in between connection attempts (maybe with exponential backoff?)
    • Whether or not to restart the subscriptions after reestablishing the connection
  • Maybe even allow deeper customization by pulling the reconnection logic into its own class ClientReconnectHandler, which would implement a simple strategy pattern to allow interchangeable reconnection mechanisms, providing a ExponentialBackoffReconnectHandler by default. The parameter could then have the signature of auto_reconnect: bool | ClientReconnectHandler = False, applying a default handler with default values when simply set to True.

I'd love to hear what you guys think about this and how you would approach this. Maybe someone has already implemented a similiar reconnect mechanism and would like to share their thoughts, I'd greatly appreciate that.

Thanks a lot!

cerlestes avatar Jun 18 '24 18:06 cerlestes