node-red-contrib-opcua icon indicating copy to clipboard operation
node-red-contrib-opcua copied to clipboard

OPC UA node not re-establishing connection after error

Open ErosBaixauli opened this issue 1 year ago • 13 comments

Hello there,

We have a node red development to operate and monitorize different machine productions via OPC UA. image

Recently, after a issue on the electricity and connections on site, we realize that the OPC connection nodes were stuck with the following error:

image

A fast flow reset solved the issue and everything started working again, but we need this to be done automatically. This machines can be shutdown from time to time, so I don't want to try to restart the node red docker when the PLC doesn't answer the requests.

So, I have some questions:

  • There is a way to manually force the reconnection, without restarting the whole flow?
  • Can we catch somehow the error message from the node to differentiate the cause of it? I saw it on the node red logs but I wasn't able to catch it on code. Maybe with the event node?
  • There is a doc of how the connect/re-connect client action works? I couldn't find it

Thanks in advantage for the help!

ErosBaixauli avatar Jul 06 '23 09:07 ErosBaixauli

  1. There is example in the examples folder how to inject disconnect / reconnect in the file OPCUA-TEST-NODES.json
  2. Catch would be optimal, I cannot remember if client node will report error in a way it can be catched => TODO/FIX
  3. Look first one

mikakaraila avatar Jul 06 '23 13:07 mikakaraila

Thanks for the fast reply.

Yes, after I posted the issue I found the examples and learned about msg.action="reconnect".

I just put on test a code to simulate the costumer issue and the reconnection option, on the connection is down for a long time and then de PLC recovers. For a while, the node itself tries to connect, throwing this error:

image

Meanwhile, I was sending a reconnect attempt by message every 30s, simulating an automatic system trying to recover.

It worked fine and the nodes were connected succesfully when we reconnect the PLC, but they appear like this

image

It stays in a loop, changing from "connected re-established" to "reconnecting..." and back to "connected re-established".

I know that ask in a loop to connect isn't a good solution, so I'll try to solve it and reach out later with the solution. If you have any tip will be appreciated.

Thanks!

ErosBaixauli avatar Jul 06 '23 15:07 ErosBaixauli

Has @erossignon commented this one?

mikakaraila avatar Aug 01 '23 11:08 mikakaraila

I just came across this issue myself, this behaviour appears to have changed between versions. In older versions it would reconnect by itself.

RedShift1 avatar Aug 09 '23 14:08 RedShift1

I'm in the same boat where the equipment gets turned off at night. I created a separate node with "Action" set to "RE-CONNECT" with the same endpoint I'm using to read the data from:

image

And then connected it to a cron job node which emits an empty message every 5 seconds:

image

Not sure if it's the right way to do it but it has worked 2 days in a row now.

RedShift1 avatar Aug 11 '23 12:08 RedShift1

NOTE: reconnect action should be used only when you first disconnect your client. It can break normal node-opcua reconnect build in functionality. Root cause for the connection break should be investigated. It is not normal that communication breaks. There must be something in the environment (server or client).

mikakaraila avatar Aug 11 '23 13:08 mikakaraila

Hello everyone,

Thanks for the implication on the issue, it's nice to see that the creators are still improving their baby haha

I manage to fix the issue having a connection control flow: when the system stops getting data from the production connection nodes, a time out line disconnects all of them and triggers a separated connection node (which receives a "disconnect" starting order) that starts trying to connect periodically (like every 15/30 min). When that node connects, all the production nodes receive the order to connect and resume operations. If that reconnection works, the connection control node disconnects until next time that it's needed.

In some cases, I found the production nodes with a dead connection issue and cannot reconnect, no matter how, until you reset the flow. I manage that case as a false positive, if after 3 attempts the connection isn't recovering but the control connection node is correctly connected, the flow executes a command to reset its own docker. As the flow doesn't depends on local variables to operate, that works fine so far.

Hope that this idea helps anyone strugling with this issue. I'll be glad if any of you have any suggestions or opinions to improve it.

If someone wants the flow json, I can prepare a sample of the connection flow. I can't share the whole json, as it has sensible data from the costumer.

ErosBaixauli avatar Aug 11 '23 15:08 ErosBaixauli

In some cases, I found the production nodes with a dead connection issue and cannot reconnect, no matter how, until you reset the flow. I manage that case as a false positive, if after 3 attempts the connection isn't recovering but the control connection node is correctly connected, the flow executes a command to reset its own docker. As the flow doesn't depends on local variables to operate, that works fine so far.

That seems like a very sledgehammer approach... Which we can't use because we have numerous other flows happening which would be interrupted by having to restart node-red...

RedShift1 avatar Aug 12 '23 11:08 RedShift1

FYI: I have been refactoring code to TypeScript based and Etienne is building client2 node that will use just one client & session. Work is in progress...

mikakaraila avatar Aug 12 '23 11:08 mikakaraila

NOTE: reconnect action should be used only when you first disconnect your client. It can break normal node-opcua reconnect build in functionality. Root cause for the connection break should be investigated. It is not normal that communication breaks. There must be something in the environment (server or client).

On the client side, this issue started occuring after updating to version 0.2.310 (sorry I did not keep track of the version that was installed before). On the server side (the PLC we're reading OPC-UA data from), no changes were made.

The problem here is that the disconnect is "abnormal" to begin with, but we don't know in advance when this disconnect is. The machine can get turned off in the evening, or during the day for maintenance, etc... So you can't schedule a reconnect action in advance, it has to be able to reconnect on its own after failures so it remains reliable.

RedShift1 avatar Aug 12 '23 11:08 RedShift1

Hope i can help with some insight here we gathered over the last few months of intensive opcua testing (different machines with daily downtimes, e.g. after production):

  • First of I want to say, that a disconnect between a client and server, be it because of a downtime (power off) or a network related topic, isn't abnomal or anything, it's a common thing in or rather after production and especially in a bigger location, where the network is not only managed centrally
  • Secondly regarding your workaround @ErosBaixauli: I honestly didn't fully get behind your disconnect/reconnect-method, but it seems quite big as already stated in here. Maybe as an inpsiration I want to tell you about our workaround for a machine, where the software/hardware is too old to update to a newer version of an opcua server, therefore the functionality of a reconnect is not given to the fact of an old opcua server. So for this machine we needed to reinitialize the connection/subscription via the earlier stated disconnect/reconnect. As a condition to trigger it we just subscribe the standardized server time tag (ns=0;i=2258) to check if we get new values. If the time doesn't change for a while (we save the time as specific variables) we conclude that the connection was lost -> then we trigger the reconnect workflow in the following order: disconnect, connect, deletesubscription, subscription (inject the needed topics again). Special remark: deletesubscription needs the specific opcua tags as topics to work. I don't know if this is intended, because I thought that the unsubscribe needs specific topics and unsubscribe works as a general opcua action, which unsubscribes everything, is that correct @mikakaraila or what exactly is the difference between unsubscribe and deletesubscription? Also I want to mention that with disconnect/connect or disconnect/reconnect or just a reconnect no new data gets retrieved. If I send a reconnect its stuck in "reconnect" (as already described). I've just retested the workaround with the new version, which was triggered because of https://github.com/mikakaraila/node-red-contrib-opcua/issues/599, and I could just use reconnect + topic.inject as a way to retrieve data again, no need for disconnect + connect + unsubscribe + topic.inject. Maybe you can slim down your way of handling the problem in that way.
  • A question regarding reconnect + new reinjection of the topics: based on the logs (as seen in the picture) the old session was terminated and a new one gets created: Is it intended that we need to reinject the topics that we initially subscribed? image
  • @ErosBaixauli also you have to check the version of the used opcua server, because we have some machines which are just too old to handle the reconnect. We tested this whole topic with different opcua clients (direct implementation with a "self build" client around node-js and GO). With a newer server everything worked, except the node-red-opcua-solution, therefore my before mentioned github issue, which is now solved with the new update. The old opcua server couldn't be reconnected automatically, therefore the reconnect-workaround.

Hope I described everything understandable (more or less). Feel free to ask questions, if anything is unclear.

fanbel avatar Sep 08 '23 14:09 fanbel

The other bug report with this issue was closed but the problem's not fixed yet... I downgraded to 0.2.292 and it still fails to reconnect after any kind of failure... I'm sure this worked perfectly in the past but now it doesn't anymore even when downgrading...

RedShift1 avatar Oct 18 '23 16:10 RedShift1

Ok I went full sledgehammer on this one... Started a separate Node-RED instance and used the Node-RED HTTP API to reload the flows once it detects it wasn't able to read data for 1 minute. Using the HTTP node you can reload the flows like this:

image

Change URL to your Node-RED's address (mine's running in Docker so localhost:1880 works). If your Node-RED uses authentication you'll need to add another HTTP header, see the HTTP API docs.

RedShift1 avatar Oct 20 '23 06:10 RedShift1