SOEM timout in ecx

Hi,

I am running on a Linux Debian 10 with a RT-PREEMPT kernel and using an I219 NIC to communicate with several Ethercat slaves. I encounter some difficulties when calling “ec_receive_process_data(int timeout)”: the timeout does not seem to be properly used. Setting 1µs or 1ms does not make any difference to the time spent in ecx_waitinframe. It seems that ecx_waitinframe is a blocking function but should not be.

I can reproduce the issue on a different hardware, running the same OS.

I have noticed that the driver is working in a blocking mode. Should I configured it to be non-blocking and do you know how to do so ?

Sep 28 '20 09:09 grandphil

SOEM configures a non-blocking RAW socket in Linux. You should check the return values of the setsocketopt and check if your kernel driver supports all the flags that SOEM sets.

From nicdrv.c:

/* we use RAW packet socket, with packet type ETH_P_ECAT */

*psock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ECAT));

timeout.tv_sec = 0; timeout.tv_usec = 1; r = setsockopt(*psock, SOL_SOCKET, SO_RCVTIMEO, &timeout, sizeof(timeout)); r = setsockopt(*psock, SOL_SOCKET, SO_SNDTIMEO, &timeout, sizeof(timeout)); i = 1; r = setsockopt(psock, SOL_SOCKET, SO_DONTROUTE, &i, sizeof(i)); / connect socket to NIC by name */ strcpy(ifr.ifr_name, ifname); r = ioctl(psock, SIOCGIFINDEX, &ifr); ifindex = ifr.ifr_ifindex; strcpy(ifr.ifr_name, ifname); ifr.ifr_flags = 0; / reset flags of NIC interface */ r = ioctl(psock, SIOCGIFFLAGS, &ifr); / set flags of NIC interface, here promiscuous and broadcast */ ifr.ifr_flags = ifr.ifr_flags | IFF_PROMISC | IFF_BROADCAST; r = ioctl(psock, SIOCSIFFLAGS, &ifr); / bind socket to protocol, in this case RAW EtherCAT */ sll.sll_family = AF_PACKET; sll.sll_ifindex = ifindex; sll.sll_protocol = htons(ETH_P_ECAT); r = bind(*psock, (struct sockaddr *)&sll, sizeof(sll));

Sep 28 '20 20:09 ArthurKetels

This issue may be related to this open issue https://github.com/OpenEtherCATsociety/SOEM/issues/430.

I checked the return values by modifying the code snippet from nicdrv.c:

   /* we use RAW packet socket, with packet type ETH_P_ECAT */
   *psock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ECAT));

   timeout.tv_sec =  0;
   timeout.tv_usec = 1;
   r = setsockopt(*psock, SOL_SOCKET, SO_RCVTIMEO, &timeout, sizeof(timeout));
   printf("setsockopt %d \n",r);

   r = setsockopt(*psock, SOL_SOCKET, SO_SNDTIMEO, &timeout, sizeof(timeout));
   printf("setsockopt %d \n",r);

   i = 1;
   r = setsockopt(*psock, SOL_SOCKET, SO_DONTROUTE, &i, sizeof(i));
   printf("setsockopt %d \n",r);

   /* connect socket to NIC by name */
   strcpy(ifr.ifr_name, ifname);
   r = ioctl(*psock, SIOCGIFINDEX, &ifr);
   printf("ioctl %d \n",r);

   ifindex = ifr.ifr_ifindex;
   strcpy(ifr.ifr_name, ifname);
   ifr.ifr_flags = 0;
   /* reset flags of NIC interface */
   r = ioctl(*psock, SIOCGIFFLAGS, &ifr);
   printf("ioctl %d \n",r);

   /* set flags of NIC interface, here promiscuous and broadcast */
   ifr.ifr_flags = ifr.ifr_flags | IFF_PROMISC | IFF_BROADCAST;
   r = ioctl(*psock, SIOCSIFFLAGS, &ifr);
   printf("ioctl %d \n",r);

   /* bind socket to protocol, in this case RAW EtherCAT */
   sll.sll_family = AF_PACKET;
   sll.sll_ifindex = ifindex;
   sll.sll_protocol = htons(ETH_P_ECAT);
   r = bind(*psock, (struct sockaddr *)&sll, sizeof(sll));
   printf("bind %d \n",r);

   /* setup ethernet headers in tx buffers so we don't have to repeat it */
   for (i = 0; i < EC_MAXBUF; i++)
   {
      ec_setupheader(&(port->txbuf[i]));
      port->rxbufstat[i] = EC_BUF_EMPTY;
   }
   ec_setupheader(&(port->txbuf2));
   if (r == 0) rval = 1;

   return rval;
}

setsocketopt() returns 0 - successful, the ioctl() return all 0 - which indicates no error. bind() returns also 0. I assume that due to this result the driver supports the flags correctly?

The timeout parameter of ec_receive_process_data(timeout) does not get respected. If i set timeout to 1us, 10us, 50us, 60us, the function does never return after the requested timeout, but just when the answer arrived. In my case as also shown in the other issue approximately after 60us.

Oct 02 '20 06:10 fmros

Could you read back the SO_RCVTIMEO and SO_SNDTIMEO values with getsocketopt()? I remember something vaguely similar a year or 10 ago. The driver accepted the setsocketopt() but rounded down the values to zero. And zero means infinite timeout. I had to set something like 20us before the driver stopped rounding down.

An other thing to try is to add SOCK_NONBLOCK to socket(). *psock = socket(PF_PACKET, SOCK_RAW | SOCK_NONBLOCK, htons(ETH_P_ECAT));

If it still does not work with above you could comment out the setsocketopt() lines. Perhaps they interfere with SOCK_NONBLOCK.

These things are really trial and error to work around bug introduced in drivers.

Oct 02 '20 07:10 ArthurKetels

Thank you!

I called the getsocketopt() like this:

   socklen_t len;
   r = getsockopt(*psock, SOL_SOCKET,SO_RCVTIMEO, &timeout, &len);
   printf("getsocketopt() SO_RCVTIMEO %d \n", r);
   printf("timeval.tvsec %jd \n", (intmax_t)timeout.tv_sec);
   printf("timeval.tvusec %jd \n", (intmax_t)timeout.tv_usec);

   r = getsockopt(*psock, SOL_SOCKET,SO_SNDTIMEO, &timeout, &len);
   printf("getsocketopt() SO_SNDTIMEO %d \n", r);
   printf("timeval.tvsec %jd \n", (intmax_t)timeout.tv_sec);
   printf("timeval.tvusec %jd \n", (intmax_t)timeout.tv_usec);

For both the two timeout cases below, the return is unstable, beeing sometimes 0 and sometimes -1, Case 1: timeout.tv_sec = 0; timeout.tv_usec = 1; ->if 0 is the return of getsockopt(), it sets tv_usec to 4000. Case 2: timeout.tv_sec = 0; timeout.tv_usec = 20; ->if 0 is the return of getsockopt(), it sets tv_usec to 4000.

Allthough the timeout seems still not beeing respected.

The

*psock = socket(PF_PACKET, SOCK_RAW | SOCK_NONBLOCK, htons(ETH_P_ECAT));

results in

terminate called after throwing an instance of 'std::system_error'
  what():  Resource deadlock avoided

with and without the setsocket() calls.

Oct 02 '20 08:10 fmros

I just tested the code on my linux box running preempt-rt. I could replicate the timeout problem. Diving into to sock.c code from linux I found the reason. The timeouts are always rounded up to the nearest system tick. The default for Debian is 250Hz, ergo the minimum timeout value is 4ms.

I also verified SOCK_NONBLOCK, and it works as expected. No terminates or whatever. The timeouts from different SOEM functions is now adhered to within a few us.

I advise to try with default simple_test to see if the problem is not somewhere in your code.

What is very important, when using small timeout values, is to set the proper parameters of the NIC with ethtool (read drvcomment.txt). Otherwise coalescing times could be larger than the SOEM timeouts leading to strange behaviour.

Oct 02 '20 11:10 ArthurKetels

I was able to get

 *psock = socket(PF_PACKET, SOCK_RAW | SOCK_NONBLOCK, htons(ETH_P_ECAT));

working. I can confirm that the timeout is working with this tweak on my Ubuntu 18.04 5.4.54-rt32 and Realtek NIC. After a missing response from the slaves, the master sends a BRD message after the specified timeout. Have to test this on another platform still. Can you confirm @grandphil too?

Oct 15 '20 10:10 fmros

@grandphil , can this be closed

Feb 09 '21 06:02 nakarlsson

Related?

A customer running Debian reported that their recv was blocking up to 10ms. They solved their issue by using, se below, would it make sense to add that to the nic_drv.c default?

bytesrx = recv(*stack->sock, (*stack->tempbuf), lp, MSG_DONTWAIT);

Feb 26 '21 17:02 nakarlsson

@grandphil , can this issue be closed?

May 28 '21 06:05 nakarlsson

@grandphil , can this issue be closed? did you test this on a platform yourself too?

May 28 '21 06:05 fmros

Hi all, I also have this problem. I'm running Debian 10 with 4.19.191-rt80 and when the connection is lost ec_receive_processdata() blocks for at least 4ms (probably related to system timer of 250Hz) before returning -1. I've tried the above solutions but none of them work. The controller is I210, with igb driver. I'm now compiling 5.4.115-rt57, (this time with 1000Hz system timer). I will update when finished.

Edit: Sorry for the noise, I thought that 4ms was somewhat related to the system timer, but actually is just the double of EC_TIMEOUTRET that I was using in ec_receive_processdata(). Reducing the timeout to 500 has reduced that to 1ms (I don't know why is the double).

I confirm that with my configuration (4.19.191-rt80 and I210) the proposed fix (add SOCK_NONBLOCK) has reduced the actual timeout from some multiple of 4ms (I see 20ms and 8ms) to ~1ms.

Thank you.

Jun 04 '21 07:06 matperc

Hi,

Seems I am a bit too late on the topic, but I thought I'd share my view on this.

I too observed the large timeout values and as Arthur states the issue is that the timeout interval of recv(), send() and poll() (unsure of select(), haven't checked) will be rounded up to the system clock granularity/system tick. I think this also relates to #171

It has been suggested to open the socket (or alter the socket) to be non-blocking. There are however two issues with this:

As the ecx_waitinframe_red() contains no sleep in the loop (which was previously designed to be within the recv() and send() system calls) you end up having a busy polling solution. With this proposal, running high priority tasks/threads (higher than kernel irq's, one could argue that would be a design flaw), on single core systems, it's easy to stall the kernel. As the kernel is never scheduled to serve your package, you'll be busy polling until you timeout and never receive your package. Unless you retry long enough for Linux's thread throtteling mechanism to throw you out, then the kernel gets time, but if we reach this, the system is broke. On multi core systems the kernel is not blocked (unless your running on the same core as the kernel or some other process is delaying the kernel), however missed packages can occur (observed this, can't explain the reason. Must be that the rx/tx irq was one the same core as the process, but later was migrated to another core due to load balancing).
This affects the complete socket, including send(). Looking through the stack it looks like it is assumed that send() is always successful. The return value of ecx_outframe/ecx_outframe_red is neglected in all cases and transmission has no timeout. Adding non-blocking behavior increases the chances that the message wasn't even sent, thus introducing a lot of busy polling for nothing.

For complition, nakarlsson, bytesrx = recv(*stack->sock, (*stack->tempbuf), lp, MSG_DONTWAIT) makes that specific call behave as NON_BLOCKING. I'd prefer this as it does not alter the behavior of send() as well.

So just adding NON_BLOCKING behavior is not, in my opinion, a universal fix. A combination of MSG_DONTWAIT + a sleeping mechanism would be preferable.

Instead of adding a usleep() to avoid busy polling, I added a ppol() in the receive case, as ppol() allows "precise timeouts" (not bound to system ticks), This was I am not bound to the usleep() granularity. This gives a bit of overhead, but at least timeouts are now respected and I get an early exit in case a package is ready. This option also allows redundant mode to be polled in the same system call.

Regarding send(), I'll start another issue.

Best regards Mathias

Feb 24 '22 09:02 mwh-newtec

Thx, Mathias for looking into this long-standing issue. I have tried multiple versions o f non-blocking tx/rx in SOEM. All of them had issues. Most of those issues you have encountered too. That is why we have the current compromise in nicdrv.c. For most users this work satisfactory.

I was not aware of the finer granularity of ppoll() over poll(). This could be a way out of our problems with accurate timing on one side and not overloading the system with busy-poll.

SOEM measures the propagation time for a packet to traverse the NIC and slaves. So it is able to predict exactly when a packet should arrive after it has been send. In an ideal world we would ask the kernel to send a packet and look for a receive at exactly x nanoseconds. When it did not receive anything by then the packet is lost (guaranteed). No need for polling at all. Unfortunately such a service is not available in the Linux kernel.

I am looking forward to your findings with regard to send()

Feb 24 '22 15:02 ArthurKetels

I encounter a similiar problem at the moment with Linux 5.4.124 and xenomai. @mwh-newtec would it be possible to provide your modification to the nicdrv with ppoll? I hope that this will fix my issues.

Jun 01 '22 16:06 unamehere

By the way is there any Idea why the receive timeout missing is slave count / frame size dependent? if I only use one slave everything works fine and the frame always gets received and no timeout ever happens. But with 2 or more slaves the timeouts start. Every Slave has 136b out and 36b input data and a calculated delay of around 800ns

Jun 01 '22 16:06 unamehere

@unamehere: Sorry for the delayed response. Attached you'll find my applied patch with ppoll. This was the easiest rewrite, however ppoll allows for a set of file descriptors, so in the case of redunadant mode, both sockets can be checked in the same call. This requires a bit more rewrite, but is certainly doable. Just wanted to get some experience with this approach before making more changes towards this.

Best regards Mathias

0001-Implemented-ppolling-on-receive.zip

Jun 09 '22 07:06 mwh-newtec

timout in ecx_waitinframe