Industrial-IoT icon indicating copy to clipboard operation
Industrial-IoT copied to clipboard

Publisher stops respecting reconnect period after initial successful connection

Open NoTuxNoBux opened this issue 1 year ago • 1 comments

Describe the bug If OPC Publisher 2.9.4 cannot connect to a certain endpoint, it will try again every second to reconnect. reconnectperiod is set to 1000 ms, so this makes sense. I upped it to 10 seconds using --reconnectperiod=10000, and that works the first time, but after it the server comes online and later on disappears again, the publisher gets stuck in errors occurring publish requests, and it keeps printing that every second rather than every 10 seconds - it doesn't seem to notice the connection as fully 'dropped' somehow and never goes back to retrying every 10 seconds, not even after 15 minutes or more of these errors.

To Reproduce Steps to reproduce the behavior:

  1. Pass --reconnectperiod=10000.
  2. Configure an OPC UA server that isn't available.
  3. Notice connections being attempted every 10 seconds as expected.
  4. Bring the server online on the specified address.
  5. Wait for the publisher to connect properly.
  6. Bring the server offline or make it unreachable.
  7. Notice errors being printed every second about not being able to publish.

Expected behavior The publisher disconnects properly, possibly after a grace period.

Additional context

The error being spammed every second after this happens is the following:

[24-02-12 14:00:48.4247] fail: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaStack[0]
      Unexpected error sending publish request.
      Opc.Ua.ServiceResultException: BadConnectionClosed
         at Opc.Ua.Bindings.UaSCUaBinaryClientChannel.BeginSendRequest(IServiceRequest request, Int32 timeout, AsyncCallback callback, Object state)
         at Opc.Ua.Client.Session.BeginPublish(Int32 timeout)
[24-02-12 14:00:48.9734] info: Azure.IIoT.OpcUa.Publisher.Stack.Services.OpcUaSubscription[0]
      Subscription <<UnknownDataSet>>_($e1822db470e60d090affd0956d743cb0e7cdf113):439048616 STOPPED!

NoTuxNoBux avatar Feb 12 '24 14:02 NoTuxNoBux

@mregen

marcschier avatar Feb 14 '24 07:02 marcschier

We are looking at streamlining the opc ua stack errors so that they do not look like exceptions, the issue is benign. Otherwise, the reconnect timing should work now, I tested this using the docker-compose deployment with MQTT and there is now also a log message that shows the timer that tracks the reconnect delays. The numbers are in millisecond and contain a jitter component.

marcschier avatar Jun 18 '24 16:06 marcschier