TF6000_ADS_DOTNET_V5_Samples icon indicating copy to clipboard operation
TF6000_ADS_DOTNET_V5_Samples copied to clipboard

ADS server instance stops listening / receiving

Open vvoldsund opened this issue 8 months ago • 1 comments

Environment:

  • TwinCAT System Service: v3.1.0.2437
  • TwinCAT: v3.1.4024.32
  • Beckhoff.TwinCAT.Ads: 6.1.260

Summary: Our .NET application creates multiple ADS server instances using the AdsServer base class. Each instance overrides:

  • OnConnected → logs the connection

  • OnDisconnect → logs and calls base.OnDisconnect()

  • OnWriteAsync → enqueues received data in a thread-safe queue within a try-catch, and always returns AdsErrorCode.NoError.

These ADS servers use the built-in TwinCAT router (installed and running on the same machine).

Problem: We have observed that one specific ADS server instance (always the same one) stops receiving any incoming data after running for a period of time (e.g., about a week at one site). There's no clear pattern, and no related logs or error events (either in our application or in the Windows Event Log) that point to a failure.

Troubleshooting steps taken when the issue occurs:

  • Running "Test-AdsRoute -Port 27001" (targeting the failed server) gives no response.

  • Running "Test-AdsRoute" against the other ports (e.g., 27004) returns INVALID, which is expected (those servers don’t handle that operation).

  • After stopping our application and re-running "Test-AdsRoute -Port 27001", the command responds with TARGET PORT NOT FOUND, confirming the server was not listening before shutdown.

The responses described are the activity observed in the ADS monitor.

Questions:

Are there known scenarios or limitations in the TwinCAT router or AdsServer class that could cause a specific server instance to silently stop listening?

Have there been any bug fixes related to this behaviour after 6.1.260?

Any insight would be appreciated.

vvoldsund avatar May 05 '25 12:05 vvoldsund

I had a short look in the release history ... Unfortunately I cannot find a specific fix for your issue. But we have put a lot of efforts in the versions since 6.1.260 in recovering from error conditions and error logging (mainly in the 6.2.X series) during the last year. This was part of the reworking of the internal communication channels (usage of UnixSockets and AdsOverMqtt as additional features). So it could be worth to try out the 6.2 version. Even if it doesn't fix the specific problem, the deep support of Logging via the ILoggerFactory in the AdsServer constructor will help to analyse the situation. The version 6.1.260 doesn't support the logging.

RalfHeitmann avatar May 09 '25 09:05 RalfHeitmann

I have now reproduced something that looks a lot like the same issue using 6.2.313 with logging set to "Information" for the ADS servers.

Monitoring the server in ADS monitor shows that the server on port 27002 does not reply when something is sent to it. so the same behaviour as the initial report, just another of the server instances we have. This server instance is a under a lot less stress than the previous one that failed. This server instance implements OnReadAsync and based on a heartbeat (a boolean indicating if configuration is needed) every ten seconds it either returns the configuration or just empty data.

Image

The information I have from the logs are (this is all that is logged from the servers after they are started):

2025-05-12 19:59:21 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "OnReceiveAmsCommands: TcpHeader invalid. Dump: TcpHeader Flags: 127, DataSize: 16843008"
2025-05-12 19:59:21 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "OnReceiveAmsCommands: TcpHeader indicates more than '16843008' bytes of data! Fatal Error."

What can cause this error? Will this error explain a server instance just stopping? How to recover from this error?

I can also note that we never had this issue for years when using the 5.0.x (latest we used was 5.0.528) packages. This is something that started happening after 6.x (6.1.260) upgrade. And the installed TwinCAT version has been 3.1.4024.32 for both of them, so no change there.

This is currently causing a high severity issue for us. Thanks in advance for any help or insight.

vvoldsund avatar May 13 '25 07:05 vvoldsund

I have reproduced the issue again with debug as log level this time. Now it is the more expected port 27001 that fails again as you can see in the attached logs.

When reading the logs, note that I in this version have added monitoring code to detect if a server instance is down. The monitoring will every ten seconds create a client for each of the server instances and call "ReadStateAsync" to get the response "DeviceServiceNotSupported" if servers instance OK. Unfortunately this creates some noise in the logs.

The new information that at least I can extract from the logs is that the fatal error is followed by:

2025-05-14 00:05:13 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "AmsServer disconnected. Port 'LogsAdsServer' (27001) received 134729852 bytes!"

I will also note that we have in our code logic to handle the instance being disconnected, but this callback is not called in this case (it is called on shutdown), so the disconnect is not propagated through the program.

        server.ServerConnectionStateChanged += (sender, args) =>
        {
            this.logger.LogInformation(
                $"Connection state changed: {args.State}, port number: {server.AdsServerConfiguration.PortNumber}, port name: {server.AdsServerConfiguration.PortName}");
            if (args.State == ServerConnectionState.Disconnected && !cancellationToken.IsCancellationRequested)
            {
                _ = Task.Run(async () => await this.ConnectWithRetryAsync(default).ConfigureAwait(false));
            }
        };

log20250514_ads.txt

vvoldsund avatar May 14 '25 06:05 vvoldsund

Reproduced the issue again. This time without the clients checking the state of the server instances creating noise in the log. The issue this time was again on the 27002 heartbeat server instance.

As you can see from the attached log heartbeat is received and responded to every ten seconds until suddently:

2025-05-14 08:53:41 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "OnReceiveAmsCommands: TcpHeader invalid. Dump: TcpHeader Flags: 127, DataSize: 16843008"
2025-05-14 08:53:41 +02:00, TwinCAT.Ads.Server.AdsServer            , "onIndicationWriteAsync: Addr:10.80.113.134.1.1:33060, InvokeId:631493, IG:0, IO:8, Len:1d9, Data:5B-7B-22-4C-6F-67-54-79-70-65-22-3A-34-2C-22-44"
2025-05-14 08:53:41 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "OnReceiveAmsCommands: TcpHeader indicates more than '16843008' bytes of data! Fatal Error."
2025-05-14 08:53:41 +02:00, TwinCAT.Ads.Server.AdsServer            , "SendResponseAsync (Before): HeaderDump:WriteResHeader(Result:Succeeded),InvokeId:631493"
2025-05-14 08:53:41 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "Finishing AmsCommands receiving scheduler"
2025-05-14 08:53:41 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "OnSendAsync '42' bytes 10.80.113.134.1.1:27001 --> 10.80.113.134.1.1:33060 (127.0.0.1:53597 --> 127.0.0.1:48898)"
2025-05-14 08:53:41 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "AmsServer disconnected. Port 'ControllerAdsServer' (27002) received 6644 bytes!"

The following heartbeats does not make it into the logs at all.

Analysing the traffic in wireshark does not show any indication of anything wrong. The package causing the failure does not look anything different than the others. As you can see the read request with response is seen every ten seconds, until suddently we have only the requests...

Image

log20250514_ads_heartbeat.txt

27002_fail_beckhoff.zip

vvoldsund avatar May 14 '25 12:05 vvoldsund

An update on the environment.

As we are using an older version of TwinCAT I tried to update to the latest. Unfortunately it made no difference.

Issue is also reproducable in:

TwinCAT System Service: v3.1.0.2441 TwinCAT: v3.1.4024.66 Beckhoff.TwinCAT.Ads: 6.2.313

vvoldsund avatar May 16 '25 05:05 vvoldsund

What can cause this error? Will this error explain a server instance just stopping? How to recover from this error?

Detected by the log is, that unexpected data is on the network stack. The receiver neither can detect valid data on it, nor can it synchronize to correct frames again. So the receiver is broken. I have released a new Package Version 6.2.345 with two relevant changes:

  1. It logs out (as Warning/Error) the last 5 TcpFrames before the problem occurred (probably this could help us to detect what triggered the problem)
  2. When the problem arises, the receiver closes the Tcp connection and establishes a new one, in the hope that the next TcpFrames are valid again.
  3. Several checks in the sending routines of the C# AdsServer implemention to double check the TcpFrames before they are sent out

Hope that will help to narrowing the problem.

I wasn't able yet to reprocuce the issue, so I am limited to the information in this thread. Are you able to create a (simple) bug demo project that I can run on my side? That would help a lot.

RalfHeitmann avatar May 16 '25 10:05 RalfHeitmann

I will update with the new package to see if I still get the issue / more information from the logs.

Regarding the first point: I also have logs from when this failes with "Verbose" setting. Maybe these can provide new information. In these logs it is the heartbeat that fails again. As mentioned earlier the heartbeat is a call to 27002 OnReadAsync that is executed every ten seconds.

Successful heartbeat:

2025-05-16 11:10:33 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "OnReceive 6 bytes (AmsConnectResponse_ControllerAdsServer_27002)"
2025-05-16 11:10:33 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "OnReceiveAmsCommands: TcpHeader with 6 bytes received"
2025-05-16 11:10:33 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "OnReceiveAmsCommands: Receiving AmsCommand header"
2025-05-16 11:10:33 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "OnReceive 32 bytes (AmsConnectResponse_ControllerAdsServer_27002)"
 2025-05-16 11:10:33 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "OnReceive 12 bytes (AmsConnectResponse_ControllerAdsServer_27002)"
 2025-05-16 11:10:33 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "OnReceiveAmsCommands: AmsCommand 10.80.113.180.1.1:26000 --> 10.80.113.180.1.1:27002, Cmd:2 REQ, State:MaskAdsRequest, Err:0, User:10, Bytes:12 received"
2025-05-16 11:10:33 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "Receiving Frame InvokeId:10,Hint:AmsConnectResponse_ControllerAdsServer_27002"
 2025-05-16 11:10:33 +02:00, TwinCAT.Ads.Server.AdsServer            , "AdsServer::OnReceivedAsync(ServerAddress:10.80.113.180.1.1:27002) : Frame: 10.80.113.180.1.1:26000 --> 10.80.113.180.1.1:27002, Cmd:2 REQ, State:MaskAdsRequest, Err:0, User:10, Bytes:12"
 2025-05-16 11:10:33 +02:00, TwinCAT.Ads.Server.AdsServer            , "onIndicationReadAsync: Addr:10.80.113.180.1.1:26000, InvokeId:10, IG:0, IO:1, Len:400"
 2025-05-16 11:10:33 +02:00, TwinCAT.Ads.Server.AdsServer            , "SendResponseAsync (Before): HeaderDump:ReadResHeader(Result:Succeeded,Len:0),InvokeId:10"
2025-05-16 11:10:33 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "OnSendAsync '46' bytes 10.80.113.180.1.1:27002 --> 10.80.113.180.1.1:26000 (127.0.0.1:61291 --> 127.0.0.1:48898)"
2025-05-16 11:10:33 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "OnReceive 6 bytes (AmsConnectResponse_LogsAdsServer_27001)"
2025-05-16 11:10:33 +02:00, TwinCAT.Ads.Server.AdsServer            , "SendResponseAsync (After): HeaderDump:ReadResHeader(Result:Succeeded,Len:0), InvokeId:10, ErrorCode:"Succeeded""
 2025-05-16 11:10:33 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "OnReceiveAmsCommands: TcpHeader with 6 bytes received"
 2025-05-16 11:10:33 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "OnReceiveAmsCommands: Receiving AmsCommand header"
2025-05-16 11:10:33 +02:00, TwinCAT.Ads.Server.AdsServer            , "Leaving AdsServer::OnReceivedAsync() : Frame: 10.80.113.180.1.1:26000 --> 10.80.113.180.1.1:27002, Cmd:2 REQ, State:MaskAdsRequest, Err:0, User:10, Bytes:12"

Failed heartbeat:

2025-05-16 11:10:43 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "OnReceive 6 bytes (AmsConnectResponse_ControllerAdsServer_27002)"
 2025-05-16 11:10:43 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "OnReceiveAmsCommands: TcpHeader with 6 bytes received"
2025-05-16 11:10:43 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "OnReceiveAmsCommands: TcpHeader invalid. Dump: TcpHeader Flags: 20490, DataSize: 16888945"
2025-05-16 11:10:43 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "OnReceiveAmsCommands: TcpHeader indicates more than '16888945' bytes of data! Fatal Error."
 2025-05-16 11:10:43 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "AmsServerImplBase(ServerAddress:10.80.113.180.1.1:27002,Hint:AmsConnectResponse_ControllerAdsServer_27002: Observer Finished"
2025-05-16 11:10:43 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "Finishing AmsCommands receiving scheduler"
2025-05-16 11:10:43 +02:00, TwinCAT.Ads.Server.AmsServerImplBase    , "AmsServer disconnected. Port 'ControllerAdsServer' (27002) received 46200 bytes!"

The OnReceive will after the first call start to return empty byte array, could it be that it interpret the empty array wrong?

        byte[] data = [];
        return ResultReadBytes.CreateSuccess(data, invokeId);

log20250516_verbose.txt

Regarding the second point: I am looking into detecting if the server instance is broken and then reconnecting. Preferably using the "ServerConnectionStateChanged " but that does not seem to be called. Is this something similar to what you are doing? If not could a server instance restart be considered instead of just silently failing?

vvoldsund avatar May 16 '25 11:05 vvoldsund

The OnReceive will after the first call start to return empty byte array, could it be that it interpret the empty array wrong?

        byte[] data = [];
        return ResultReadBytes.CreateSuccess(data, invokeId);

Where do you see this? Actually I cannot follow you ...

RalfHeitmann avatar May 16 '25 13:05 RalfHeitmann

I am looking into detecting if the server instance is broken and then reconnecting. Preferably using the "ServerConnectionStateChanged " but that does not seem to be called. Is this something similar to what you are doing? If not could a server instance restart be considered instead of just silently failing?

Yes that is exactly where I am working on. In the latest version there should the new value ServerConnectionStateChanged.Reconnected triggered in that situation, what means that the Tcp Connection was reestablished.

RalfHeitmann avatar May 16 '25 13:05 RalfHeitmann

That was a reference to my own code. From the logs it looked like the read command was logged, but the response was not. So just to mention that the return is an empty byte array.

On another note I am struggeling with the 6.2.345 version. My application refuses to stop when using it. I have also tried with 6.2.335 and that seems to have the same issue. 6.2.313 stops just fine. Can you see any reason for this?

Edit: 6.2.319 and 6.2.331 stops as expected.

Edit: 6.2.345 an 6.2.335 actually stops in the end, but now it takes about four minutes instead of seconds. When pausing debugging it is doing something with threadpool and semaphores.

vvoldsund avatar May 16 '25 13:05 vvoldsund

I have now throughout the weekend with 16 virtual machines running 6.2.345 and three running 6.2.313. None of the 6.2.345 failed, so no log message recorded. One of the 6.2.313 failed within a day.

I am not sure if this was the expected outcome, as I assume there will be logging if the recovery actions described above are put in place? Will no logging indicate that no packages are lost / discarded?

Other than the fact that it seems like something has happened after 6.2.313 so that the issue does not happen again. There seems to be issues in 6.2.345 for both startup and shutdown. Shutdown now takes minutes instead of seconds and startup straight after shutdown now can take more than the 30s it get before timing out (previously it used a few seconds).

vvoldsund avatar May 19 '25 05:05 vvoldsund

On another note I am struggeling with the 6.2.345 version. My application refuses to stop when using it. I have also tried with 6.2.335 and that seems to have the same issue. 6.2.313 stops just fine. Can you see any reason for this?

Yes, unfortunately we seem to have introduced an issue with shutting down the AdsServer (the receiving loop remains). How do you close your AdsServer/AdsClient instance and connections? I will address this with a next Update.

I am not sure if this was the expected outcome, as I assume there will be logging if the recovery actions described above are put in place? Will no logging indicate that no packages are lost / discarded?

Yes, indications about defect packages will be written to the warning/error log. The recovery then should be seen on LogLevel.Information. Sucessfully received packages should only be seen on LogLevel.Debug or LogLevel.Trace.

RalfHeitmann avatar May 19 '25 07:05 RalfHeitmann

Yes, unfortunately we seem to have introduced an issue with shutting down the AdsServer (the receiving loop remains). How do you close your AdsServer/AdsClient instance and connections? I will address this with a next Update.

We use server instances in two different ways:

  1. Adding using DI and calling ConnectServer and Disconnect from StartAsync and StopAsync of a background service holding the server. This works fine also in 6.2.345.
  2. Creating the server in ExecuteAsync of a background service and calling ConnectServerAndWaitAsync with a cancellation token. This does not seem to work, it is not cancelled correctly.

Yes, indications about defect packages will be written to the warning/error log. The recovery then should be seen on LogLevel.Information. Sucessfully received packages should only be seen on LogLevel.Debug or LogLevel.Trace.

Do you think the underlying issue is fixed or is it just "luck" that I have not seen it happen after the upgrade?

vvoldsund avatar May 19 '25 10:05 vvoldsund

2. Creating the server in ExecuteAsync of a background service and calling ConnectServerAndWaitAsync with a cancellation token. This does not seem to work, it is not cancelled correctly.

That one is identified and is fixed for the next version.

Do you think the underlying issue is fixed or is it just "luck" that I have not seen it happen after the upgrade?

Thats hard to tell. There is no explicit fix from builds 313 to 345 that directly address such an issue. But instead the AdsServer/AdsClient receiver loop was heavily changed for additional problem logging and problem detection. The last version where you reproduced the issue was the 313?

RalfHeitmann avatar May 20 '25 06:05 RalfHeitmann

We have now deployed 6.2.345 to production. A bit early to say if we see the same issue, but we have a new one (possibly related).

At one specific installation I have checked an errror terminating the application occured.

As seen by the Windows event log:

CoreCLR Version: 8.0.1625.21506
.NET Version: 8.0.16
Description: The process was terminated due to an unhandled exception.
Exception Info: System.InvalidOperationException: The stream does not support reading.
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 count)
   at TwinCAT.Ads.Internal.NetworkStreamExtension.ReadExactBytes(Stream stream, Byte[] buffer, Int32 offset, Int32 length, String debugHint, CancellationToken cancel)
   at TwinCAT.Ads.Server.AmsServerNet.OnReceiveAmsCommands(IObserver`1 ob, String debugHint)
   at TwinCAT.Ads.Server.AmsServerImplBase.<>c__DisplayClass65_0.<createReceiverObservable>b__0(IObserver`1 ob)
--- End of stack trace from previous location ---
   at System.Reactive.PlatformServices.ExceptionServicesImpl.Rethrow(Exception exception)
   at System.Reactive.ExceptionHelpers.Throw(Exception exception)
   at System.Reactive.Stubs.<>c.<.cctor>b__2_1(Exception ex)
   at System.Reactive.AnonymousSafeObserver`1.OnError(Exception error)
   at System.Reactive.Sink`1.ForwardOnError(Exception error)
   at System.Reactive.ObserveOnObserverNew`1.DrainShortRunning(IScheduler recursiveScheduler)
   at System.Reactive.Concurrency.EventLoopScheduler.Run()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)

This specific installation has been running 6.1.260 for about two months without this terminating error, and now in four days with 6.2.345 this error has happened twice. Nothing is logged by the application (leg level information on ADS) as it just terminate.

vvoldsund avatar May 26 '25 08:05 vvoldsund

Haven't got any reports about this the recent weeks. I assume the issue is fixed. Please reopen or create a new issue if it happens again.

RalfHeitmann avatar Jul 21 '25 07:07 RalfHeitmann