timout in ecx_waitinframe
Hi,
I am running on a Linux Debian 10 with a RT-PREEMPT kernel and using an I219 NIC to communicate with several Ethercat slaves. I encounter some difficulties when calling “ec_receive_process_data(int timeout)”: the timeout does not seem to be properly used. Setting 1µs or 1ms does not make any difference to the time spent in ecx_waitinframe. It seems that ecx_waitinframe is a blocking function but should not be.
I can reproduce the issue on a different hardware, running the same OS.
I have noticed that the driver is working in a blocking mode. Should I configured it to be non-blocking and do you know how to do so ?
SOEM configures a non-blocking RAW socket in Linux. You should check the return values of the setsocketopt and check if your kernel driver supports all the flags that SOEM sets.
From nicdrv.c:
/* we use RAW packet socket, with packet type ETH_P_ECAT */
*psock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ECAT));
timeout.tv_sec = 0; timeout.tv_usec = 1; r = setsockopt(*psock, SOL_SOCKET, SO_RCVTIMEO, &timeout, sizeof(timeout)); r = setsockopt(*psock, SOL_SOCKET, SO_SNDTIMEO, &timeout, sizeof(timeout)); i = 1; r = setsockopt(psock, SOL_SOCKET, SO_DONTROUTE, &i, sizeof(i)); / connect socket to NIC by name */ strcpy(ifr.ifr_name, ifname); r = ioctl(psock, SIOCGIFINDEX, &ifr); ifindex = ifr.ifr_ifindex; strcpy(ifr.ifr_name, ifname); ifr.ifr_flags = 0; / reset flags of NIC interface */ r = ioctl(psock, SIOCGIFFLAGS, &ifr); / set flags of NIC interface, here promiscuous and broadcast */ ifr.ifr_flags = ifr.ifr_flags | IFF_PROMISC | IFF_BROADCAST; r = ioctl(psock, SIOCSIFFLAGS, &ifr); / bind socket to protocol, in this case RAW EtherCAT */ sll.sll_family = AF_PACKET; sll.sll_ifindex = ifindex; sll.sll_protocol = htons(ETH_P_ECAT); r = bind(*psock, (struct sockaddr *)&sll, sizeof(sll));
This issue may be related to this open issue https://github.com/OpenEtherCATsociety/SOEM/issues/430.
I checked the return values by modifying the code snippet from nicdrv.c:
/* we use RAW packet socket, with packet type ETH_P_ECAT */
*psock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ECAT));
timeout.tv_sec = 0;
timeout.tv_usec = 1;
r = setsockopt(*psock, SOL_SOCKET, SO_RCVTIMEO, &timeout, sizeof(timeout));
printf("setsockopt %d \n",r);
r = setsockopt(*psock, SOL_SOCKET, SO_SNDTIMEO, &timeout, sizeof(timeout));
printf("setsockopt %d \n",r);
i = 1;
r = setsockopt(*psock, SOL_SOCKET, SO_DONTROUTE, &i, sizeof(i));
printf("setsockopt %d \n",r);
/* connect socket to NIC by name */
strcpy(ifr.ifr_name, ifname);
r = ioctl(*psock, SIOCGIFINDEX, &ifr);
printf("ioctl %d \n",r);
ifindex = ifr.ifr_ifindex;
strcpy(ifr.ifr_name, ifname);
ifr.ifr_flags = 0;
/* reset flags of NIC interface */
r = ioctl(*psock, SIOCGIFFLAGS, &ifr);
printf("ioctl %d \n",r);
/* set flags of NIC interface, here promiscuous and broadcast */
ifr.ifr_flags = ifr.ifr_flags | IFF_PROMISC | IFF_BROADCAST;
r = ioctl(*psock, SIOCSIFFLAGS, &ifr);
printf("ioctl %d \n",r);
/* bind socket to protocol, in this case RAW EtherCAT */
sll.sll_family = AF_PACKET;
sll.sll_ifindex = ifindex;
sll.sll_protocol = htons(ETH_P_ECAT);
r = bind(*psock, (struct sockaddr *)&sll, sizeof(sll));
printf("bind %d \n",r);
/* setup ethernet headers in tx buffers so we don't have to repeat it */
for (i = 0; i < EC_MAXBUF; i++)
{
ec_setupheader(&(port->txbuf[i]));
port->rxbufstat[i] = EC_BUF_EMPTY;
}
ec_setupheader(&(port->txbuf2));
if (r == 0) rval = 1;
return rval;
}
setsocketopt() returns 0 - successful, the ioctl() return all 0 - which indicates no error. bind() returns also 0. I assume that due to this result the driver supports the flags correctly?
The timeout parameter of ec_receive_process_data(timeout) does not get respected. If i set timeout to 1us, 10us, 50us, 60us, the function does never return after the requested timeout, but just when the answer arrived. In my case as also shown in the other issue approximately after 60us.
Could you read back the SO_RCVTIMEO and SO_SNDTIMEO values with getsocketopt()? I remember something vaguely similar a year or 10 ago. The driver accepted the setsocketopt() but rounded down the values to zero. And zero means infinite timeout. I had to set something like 20us before the driver stopped rounding down.
An other thing to try is to add SOCK_NONBLOCK to socket(). *psock = socket(PF_PACKET, SOCK_RAW | SOCK_NONBLOCK, htons(ETH_P_ECAT));
If it still does not work with above you could comment out the setsocketopt() lines. Perhaps they interfere with SOCK_NONBLOCK.
These things are really trial and error to work around bug introduced in drivers.
Thank you!
I called the getsocketopt() like this:
socklen_t len;
r = getsockopt(*psock, SOL_SOCKET,SO_RCVTIMEO, &timeout, &len);
printf("getsocketopt() SO_RCVTIMEO %d \n", r);
printf("timeval.tvsec %jd \n", (intmax_t)timeout.tv_sec);
printf("timeval.tvusec %jd \n", (intmax_t)timeout.tv_usec);
r = getsockopt(*psock, SOL_SOCKET,SO_SNDTIMEO, &timeout, &len);
printf("getsocketopt() SO_SNDTIMEO %d \n", r);
printf("timeval.tvsec %jd \n", (intmax_t)timeout.tv_sec);
printf("timeval.tvusec %jd \n", (intmax_t)timeout.tv_usec);
For both the two timeout cases below, the return is unstable, beeing sometimes 0 and sometimes -1, Case 1: timeout.tv_sec = 0; timeout.tv_usec = 1; ->if 0 is the return of getsockopt(), it sets tv_usec to 4000. Case 2: timeout.tv_sec = 0; timeout.tv_usec = 20; ->if 0 is the return of getsockopt(), it sets tv_usec to 4000.
Allthough the timeout seems still not beeing respected.
The
*psock = socket(PF_PACKET, SOCK_RAW | SOCK_NONBLOCK, htons(ETH_P_ECAT));
results in
terminate called after throwing an instance of 'std::system_error'
what(): Resource deadlock avoided
with and without the setsocket() calls.
I just tested the code on my linux box running preempt-rt. I could replicate the timeout problem. Diving into to sock.c code from linux I found the reason. The timeouts are always rounded up to the nearest system tick. The default for Debian is 250Hz, ergo the minimum timeout value is 4ms.
I also verified SOCK_NONBLOCK, and it works as expected. No terminates or whatever. The timeouts from different SOEM functions is now adhered to within a few us.
I advise to try with default simple_test to see if the problem is not somewhere in your code.
What is very important, when using small timeout values, is to set the proper parameters of the NIC with ethtool (read drvcomment.txt). Otherwise coalescing times could be larger than the SOEM timeouts leading to strange behaviour.
I was able to get
*psock = socket(PF_PACKET, SOCK_RAW | SOCK_NONBLOCK, htons(ETH_P_ECAT));
working.
I can confirm that the timeout is working with this tweak on my Ubuntu 18.04 5.4.54-rt32 and Realtek NIC. After a missing response from the slaves, the master sends a BRD message after the specified timeout.
Have to test this on another platform still. Can you confirm @grandphil too?
@grandphil , can this be closed
Related?
A customer running Debian reported that their recv was blocking up to 10ms. They solved their issue by using, se below, would it make sense to add that to the nic_drv.c default?
bytesrx = recv(*stack->sock, (*stack->tempbuf), lp, MSG_DONTWAIT);
@grandphil , can this issue be closed?
@grandphil , can this issue be closed? did you test this on a platform yourself too?
Hi all, I also have this problem. I'm running Debian 10 with 4.19.191-rt80 and when the connection is lost ec_receive_processdata() blocks for at least 4ms (probably related to system timer of 250Hz) before returning -1. I've tried the above solutions but none of them work. The controller is I210, with igb driver. I'm now compiling 5.4.115-rt57, (this time with 1000Hz system timer). I will update when finished.
Edit: Sorry for the noise, I thought that 4ms was somewhat related to the system timer, but actually is just the double of EC_TIMEOUTRET that I was using in ec_receive_processdata(). Reducing the timeout to 500 has reduced that to 1ms (I don't know why is the double).
I confirm that with my configuration (4.19.191-rt80 and I210) the proposed fix (add SOCK_NONBLOCK) has reduced the actual timeout from some multiple of 4ms (I see 20ms and 8ms) to ~1ms.
Thank you.
Hi,
Seems I am a bit too late on the topic, but I thought I'd share my view on this.
I too observed the large timeout values and as Arthur states the issue is that the timeout interval of recv(), send() and poll() (unsure of select(), haven't checked) will be rounded up to the system clock granularity/system tick. I think this also relates to #171
It has been suggested to open the socket (or alter the socket) to be non-blocking. There are however two issues with this:
-
As the ecx_waitinframe_red() contains no sleep in the loop (which was previously designed to be within the recv() and send() system calls) you end up having a busy polling solution. With this proposal, running high priority tasks/threads (higher than kernel irq's, one could argue that would be a design flaw), on single core systems, it's easy to stall the kernel. As the kernel is never scheduled to serve your package, you'll be busy polling until you timeout and never receive your package. Unless you retry long enough for Linux's thread throtteling mechanism to throw you out, then the kernel gets time, but if we reach this, the system is broke. On multi core systems the kernel is not blocked (unless your running on the same core as the kernel or some other process is delaying the kernel), however missed packages can occur (observed this, can't explain the reason. Must be that the rx/tx irq was one the same core as the process, but later was migrated to another core due to load balancing).
-
This affects the complete socket, including send(). Looking through the stack it looks like it is assumed that send() is always successful. The return value of ecx_outframe/ecx_outframe_red is neglected in all cases and transmission has no timeout. Adding non-blocking behavior increases the chances that the message wasn't even sent, thus introducing a lot of busy polling for nothing.
For complition, nakarlsson, bytesrx = recv(*stack->sock, (*stack->tempbuf), lp, MSG_DONTWAIT) makes that specific call behave as NON_BLOCKING. I'd prefer this as it does not alter the behavior of send() as well.
So just adding NON_BLOCKING behavior is not, in my opinion, a universal fix. A combination of MSG_DONTWAIT + a sleeping mechanism would be preferable.
Instead of adding a usleep() to avoid busy polling, I added a ppol() in the receive case, as ppol() allows "precise timeouts" (not bound to system ticks), This was I am not bound to the usleep() granularity. This gives a bit of overhead, but at least timeouts are now respected and I get an early exit in case a package is ready. This option also allows redundant mode to be polled in the same system call.
Regarding send(), I'll start another issue.
Best regards Mathias
Thx, Mathias for looking into this long-standing issue. I have tried multiple versions o f non-blocking tx/rx in SOEM. All of them had issues. Most of those issues you have encountered too. That is why we have the current compromise in nicdrv.c. For most users this work satisfactory.
I was not aware of the finer granularity of ppoll() over poll(). This could be a way out of our problems with accurate timing on one side and not overloading the system with busy-poll.
SOEM measures the propagation time for a packet to traverse the NIC and slaves. So it is able to predict exactly when a packet should arrive after it has been send. In an ideal world we would ask the kernel to send a packet and look for a receive at exactly x nanoseconds. When it did not receive anything by then the packet is lost (guaranteed). No need for polling at all. Unfortunately such a service is not available in the Linux kernel.
I am looking forward to your findings with regard to send()
I encounter a similiar problem at the moment with Linux 5.4.124 and xenomai. @mwh-newtec would it be possible to provide your modification to the nicdrv with ppoll? I hope that this will fix my issues.
By the way is there any Idea why the receive timeout missing is slave count / frame size dependent? if I only use one slave everything works fine and the frame always gets received and no timeout ever happens. But with 2 or more slaves the timeouts start. Every Slave has 136b out and 36b input data and a calculated delay of around 800ns
@unamehere: Sorry for the delayed response. Attached you'll find my applied patch with ppoll. This was the easiest rewrite, however ppoll allows for a set of file descriptors, so in the case of redunadant mode, both sockets can be checked in the same call. This requires a bit more rewrite, but is certainly doable. Just wanted to get some experience with this approach before making more changes towards this.
Best regards Mathias