asynOctetSetOutputEos and asynOctetSetInputEos block IOC init and exit if no device connected
Hi,
Setting the OEOS (or IEOS) in the st.cmd file of the testConnect App using an IP that will never connect ends in the IOC blocked a considerable amount of time during init process and at exit. I am running debian/buster and next configurations: a) base 3.15.7 + asyn R4-37 b) base 3.15.7 + asyn master (epics-modules/asyn@0679e9f3f757...) c) base 7.0.3.1(epics-base/epics-base@6feb1c788d...) + asyn master (epics-modules/asyn@0679e9f...)
The modified st.cmd file of the testConnect App looks as follows:
dbLoadDatabase("../../dbd/testConnect.dbd")
testConnect_registerRecordDeviceDriver(pdbbase)
# This is a bogus IP address that will never connect
drvAsynIPPortConfigure("IPPort", "164.54.160.220:20", 0, 0, 0);
asynSetTraceIOMask("IPPort",0,0xFF)
asynOctetSetOutputEos("IPPort",0,"\r")
iocInit()
No matter what config (a, b or c) i use, the output is very similar, the init takes 2min+10 seconds and the exit between 2 min and 16 min.
Next output belongs to test case c).
vagrant@buster:~/development/asyn/iocBoot/ioctestConnect$ ../../bin/linux-x86_64/testConnect st.cmd
dbLoadDatabase("../../dbd/testConnect.dbd")
testConnect_registerRecordDeviceDriver(pdbbase)
# This is a bogus IP address that will never connect
drvAsynIPPortConfigure("IPPort", "164.54.160.220:20", 0, 0, 0);
asynSetTraceMask("IPPort",0,0xFF)
asynOctetSetOutputEos("IPPort",0,"\r")
2020/02/13 16:52:04.161 IPPort addr -1 queueRequest priority 3 not lockHolder
2020/02/13 16:54:14.301 asynManager connect queueCallback port:IPPort
2020/02/13 16:54:14.301 IPPort set Eos 1
iocInit()
Starting iocInit
############################################################################
## EPICS R7.0.3.2-DEV
## Rev. R7.0.3.1-78-g6feb1c788d261376d3b1
############################################################################
2020/02/13 16:54:14.302 IPPort -1 autoConnect
2020/02/13 16:54:14.302 Attempting to connect to 164.54.160.220:20 reason:0 fd:-1
iocRun: All initialization complete
epics> exit
2020/02/13 16:54:20.529 IPPort lockPort
2020/02/13 16:54:34.296 IPPort addr -1 queueRequest priority 3 not lockHolder
2020/02/13 16:56:25.374 IPPort -1 autoConnect could not connect: Can't connect to 164.54.160.220:20: Connection timed out
2020/02/13 16:56:25.374 asynManager connect queueCallback port:IPPort
2020/02/13 16:56:25.374 Attempting to connect to 164.54.160.220:20 reason:0 fd:-1
2020/02/13 16:58:36.438 IPPort -1 autoConnect
2020/02/13 16:58:36.438 Attempting to connect to 164.54.160.220:20 reason:0 fd:-1
2020/02/13 16:58:56.434 IPPort addr -1 queueRequest priority 3 not lockHolder
2020/02/13 17:00:47.510 asynManager connect queueCallback port:IPPort
2020/02/13 17:00:47.510 Attempting to connect to 164.54.160.220:20 reason:0 fd:-1
2020/02/13 17:02:58.584 IPPort -1 autoConnect
2020/02/13 17:02:58.584 IPPort unlockPort
2020/02/13 17:02:58.584 Attempting to connect to 164.54.160.220:20 reason:0 fd:-1
vagrant@buster:~/development/asyn/iocBoot/ioctestConnect$
I ran your test on Centos 7 with base 7.0.3.1 and asyn master, the same as your case c).
This is what I get:
corvette:asyn/iocBoot/ioctestConnect>../../bin/linux-x86_64/testConnect st1.cmd
dbLoadDatabase("../../dbd/testConnect.dbd")
testConnect_registerRecordDeviceDriver(pdbbase)
# This is a bogus IP address that will never connect
drvAsynIPPortConfigure("IPPort", "164.54.160.220:20", 0, 0, 0);
asynSetTraceMask("IPPort",0,0xFF)
asynOctetSetOutputEos("IPPort",0,"\r")
2020/02/13 16:20:50.919 IPPort addr -1 queueRequest priority 3 not lockHolder
2020/02/13 16:20:53.435 asynManager connect queueCallback port:IPPort
2020/02/13 16:20:53.435 IPPort set Eos 1
2020/02/13 16:20:53.435 IPPort -1 autoConnect
2020/02/13 16:20:53.435 Attempting to connect to 164.54.160.220:20 reason:0 fd:-1
iocInit()
Starting iocInit
############################################################################
## EPICS R7.0.3.1
## EPICS Base built Nov 2 2019
############################################################################
cas warning: Configured TCP port was unavailable.
cas warning: Using dynamically assigned TCP port 44792,
cas warning: but now two or more servers share the same UDP port.
cas warning: Depending on your IP kernel this server may not be
cas warning: reachable with UDP unicast (a host's IP in EPICS_CA_ADDR_LIST)
iocRun: All initialization complete
epics> 2020/02/13 16:20:56.441 IPPort -1 autoConnect could not connect: Can't connect to 164.54.160.220:20: No route to host
2020/02/13 16:20:56.441 IPPort -1 autoConnect
2020/02/13 16:20:56.441 Attempting to connect to 164.54.160.220:20 reason:0 fd:-1
epics> exit
2020/02/13 16:20:59.137 IPPort lockPort
2020/02/13 16:20:59.447 IPPort unlockPort
That test took 130 seconds before iocInit on your system, and 2.5 seconds on mine.
That test took 518 seconds after typing "exit" on your system, and 0.3 seconds on mine.
Here is another test without EPICS or asyn, just using telnet.
corvette:asyn/iocBoot/ioctestConnect>date ; telnet 164.54.160.220 ; date
Thu Feb 13 16:28:33 CST 2020
Trying 164.54.160.220...
telnet: connect to address 164.54.160.220: No route to host
Thu Feb 13 16:28:36 CST 2020
On my system that command takes 3 seconds to time out. How long does it take on your system?
OK, I think I may understand why we are setting different results.
I just did some tests on an Ubuntu 18 system with 2 network cards:
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 164.54.160.227 netmask 255.255.255.0 broadcast 164.54.160.255
inet6 fe80::ae99:ddf6:a194:866c prefixlen 64 scopeid 0x20<link>
ether 54:bf:64:66:64:fd txqueuelen 1000 (Ethernet)
RX packets 351755361 bytes 34844739867 (34.8 GB)
RX errors 0 dropped 22 overruns 0 frame 0
TX packets 550889385 bytes 500449274367 (500.4 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 16 memory 0x92f00000-92f20000
enp23s0f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 192.168.0.1 netmask 255.255.255.0 broadcast 192.168.0.255
inet6 fe80::b49a:5672:707a:2bd5 prefixlen 64 scopeid 0x20<link>
ether b4:96:91:2e:0a:76 txqueuelen 1000 (Ethernet)
RX packets 107953776 bytes 8904709905 (8.9 GB)
RX errors 333246 dropped 21 overruns 0 frame 333246
TX packets 471032 bytes 71899456 (71.8 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
On the first network 164.54.160.220 does not exist, and on the second network 192.168.0.10 does not exist.
If I run the testConnect tests as shown previously to 164.54.160.220:20 or 192.168.0.10:20 I see the fast timeouts, 2.5 seconds before iocInit, and 0.3 seconds to exit.
However, if I try to connect to 192.168.1.10:20 I see very different results. That is an IP address that also does not exist, but it is not directly reachable by either of the 2 network cards, because it is not on their subnets.
This is what I see:
TahoeU18:/corvette/home/epics/devel/asyn/iocBoot/ioctestConnect> ../../bin/linux-x86_64-ub18/testConnect st1.cmd
dbLoadDatabase("../../dbd/testConnect.dbd")
testConnect_registerRecordDeviceDriver(pdbbase)
# This is a bogus IP address that will never connect
drvAsynIPPortConfigure("IPPort", "192.168.1.10:20", 0, 0, 0);
asynSetTraceMask("IPPort",0,0xFF)
asynOctetSetOutputEos("IPPort",0,"\r")
2020/02/14 01:19:31.185 IPPort addr -1 queueRequest priority 3 not lockHolder
2020/02/14 01:21:41.377 asynManager connect queueCallback port:IPPort
2020/02/14 01:21:41.377 IPPort set Eos 1
2020/02/14 01:21:41.377 IPPort -1 autoConnect
2020/02/14 01:21:41.377 Attempting to connect to 192.168.1.10:20 reason:0 fd:-1
iocInit()
Starting iocInit
############################################################################
## EPICS R7.0.3.1
## EPICS Base built Nov 25 2019
############################################################################
iocRun: All initialization complete
epics> 2020/02/14 01:22:01.372 IPPort addr -1 queueRequest priority 3 not lockHolder
epics> exit
2020/02/14 01:22:21.299 IPPort lockPort
2020/02/14 01:23:52.449 IPPort -1 autoConnect could not connect: Can't connect to 192.168.1.10:20: Connection timed out
2020/02/14 01:23:52.449 IPPort unlockPort
2020/02/14 01:23:52.449 asynManager connect queueCallback port:IPPort
2020/02/14 01:23:52.449 Attempting to connect to 192.168.1.10:20 reason:0 fd:-1
The time before iocInit is now 130 seconds, the same as @ebernalru. The time to exit is 90 seconds, much longer than the 0.3 seconds when the address is on a reachable subnet.
When your real IOC is hanging are the addresses it is trying to reach on the local subnet, or on another subnet?
@norumwe12 can you comment on this issue?
Michael Davidsaver suggested that we could have a non-blocking call to connect() and then use select() to determine if it has connected and control the timeout ourselves. Is this possible?
If not then is there another way to reduce the timeout on a per-port basis? I think we could adjust it globally on Linux with /proc/sys/net/ipv4/tcp_syn_retries, which is 6 on my system. I think that is why it is sending the SYN messages 6 times with delays of 1, 2, 4, 8, 16, 32, and 64 seconds.
Hi Mark, I just ran telnet, and it takes 130s to timeout. That is exactly the time that takes those 6 retries. tcp_syn_retries = 6 is the default value at least on debian/buster, centos7 and ubuntu/trusty64. Yes, I am playing on different subnets. What about setting TCP_USER_TIMEOUT? Or, is there any way we can modify the delay time between retries? Setting it to 1s instead of double it each try?
@mdavidsaver wrote:
I happened to notice that Base has a hidden connectWithTimeout() defined for RTEMS. The comment at the top of the file suggests that there is a similarly named function available in vxWorks. This implementation would also likely work on Linux and BSD derivatives. And possibly on winsock with some modification (DWORD instead of timeval). https://github.com/epics-base/epics-base/blob/90c0f5c48d133cfd0b394cd724eddbca6ff41a1b/modules/libcom/RTEMS/rtems_util.c#L26
This sounds promising so I just created a branch of asyn to test it. I used the existing asynSetAutoConnectTimeout to set a timeout, and added pasynManager::getAutoConnectTimeout to be able to read it. I modified drvAsynIPPort::connectIt to use that timeout to set the socket timeout value using code similar to that in the link from @mdavidsaver :
if (tty->socketType != SOCK_DGRAM) {
double connectTimeout;
struct timeval saveTV, connectTV;
socklen_t svlen = sizeof saveTV;
pasynManager->getAutoConnectTimeout(&connectTimeout);
connectTV.tv_sec = (time_t)connectTimeout;
connectTV.tv_usec = (suseconds_t)((connectTimeout - connectTV.tv_sec)*1000000);
asynPrint(pasynUser, ASYN_TRACE_ERROR, "Calling setsockopt SO_RCVTIMEO tv_sec=%d tv_usec=%d\n", connectTV.tv_sec, connectTV.tv_usec);
if (getsockopt (fd, SOL_SOCKET, SO_RCVTIMEO, (char *)&saveTV, &svlen) < 0) {
asynPrint(pasynUser, ASYN_TRACE_ERROR, "connectIt, error calling getsockopt for SO_RECVTIMEO: %s\n", strerror(SOCKERRNO));
}
if (setsockopt (fd, SOL_SOCKET, SO_RCVTIMEO, (char *)&connectTV, sizeof connectTV) < 0) {
asynPrint(pasynUser, ASYN_TRACE_ERROR, "connectIt, error calling setsockopt for SO_RECVTIMEO: %s\n", strerror(SOCKERRNO));
}
asynPrint(pasynUser, ASYN_TRACE_ERROR, "connectIt, calling connect()\n");
if (connect(fd, &tty->farAddr.oa.sa, (int)tty->farAddrSize) < 0) {
asynPrint(pasynUser, ASYN_TRACE_ERROR, "connectIt, connect returned error: %s\n", strerror(SOCKERRNO));
epicsSnprintf(pasynUser->errorMessage,pasynUser->errorMessageSize,
"Can't connect to %s: %s",
tty->IPDeviceName, strerror(SOCKERRNO));
epicsSocketDestroy(fd);
if (tty->flags & FLAG_DONE_LOOKUP)
tty->flags |= FLAG_NEED_LOOKUP;
return asynError;
}
if (setsockopt (fd, SOL_SOCKET, SO_RCVTIMEO, (char *)&saveTV, sizeof saveTV) < 0) {
asynPrint(pasynUser, ASYN_TRACE_ERROR, "connectIt, error calling setsockopt for SO_RECVTIMEO: %s\n", strerror(SOCKERRNO));
}
}
I used this startup script to test it:
corvette:asyn/iocBoot/ioctestConnect>more st3.cmd
dbLoadDatabase("../../dbd/testConnect.dbd")
testConnect_registerRecordDeviceDriver(pdbbase)
# This is a bogus IP address that will never connect
asynSetAutoConnectTimeout 5
drvAsynIPPortConfigure("IPPort", "192.168.1.10:20", 0, 0, 0);
asynSetTraceMask("IPPort",0,0xFF)
iocInit()
If this mechanism worked I would expect that the call to connect() would time out in 5 seconds. However, this is what I see:
corvette:asyn/iocBoot/ioctestConnect>../../bin/linux-x86_64/testConnect st3.cmd
dbLoadDatabase("../../dbd/testConnect.dbd")
testConnect_registerRecordDeviceDriver(pdbbase)
# This is a bogus IP address that will never connect
asynSetAutoConnectTimeout 5
drvAsynIPPortConfigure("IPPort", "192.168.1.10:20", 0, 0, 0);
2020/02/17 12:09:46.196 Calling setsockopt SO_RCVTIMEO tv_sec=5 tv_usec=0
2020/02/17 12:09:46.196 connectIt, calling connect()
asynSetTraceMask("IPPort",0,0xFF)
iocInit()
Starting iocInit
############################################################################
## EPICS R7.0.3.1
## EPICS Base built Nov 2 2019
############################################################################
cas warning: Configured TCP port was unavailable.
cas warning: Using dynamically assigned TCP port 40795,
cas warning: but now two or more servers share the same UDP port.
cas warning: Depending on your IP kernel this server may not be
cas warning: reachable with UDP unicast (a host's IP in EPICS_CA_ADDR_LIST)
iocRun: All initialization complete
epics> 2020/02/17 12:11:53.483 connectIt, connect returned error: Connection timed out
2020/02/17 12:11:53.483 IPPort -1 autoConnect
2020/02/17 12:11:53.483 Attempting to connect to 192.168.1.10:20 reason:0 fd:-1
2020/02/17 12:11:53.483 Calling setsockopt SO_RCVTIMEO tv_sec=5 tv_usec=0
2020/02/17 12:11:53.483 connectIt, calling connect()
So the messages printed immediately before and after the call to connect() are these:
2020/02/17 12:09:46.196 connectIt, calling connect()
2020/02/17 12:11:53.483 connectIt, connect returned error: Connection timed out
So it took 127 seconds, which is exactly what is expected from the exponential backoffs using timeouts of 1, 2, 4, 8, 16, 32, and 64 seconds, for which the sum is 127 seconds.
So it appears that calling setsockopt with a timeval of 5 seconds has not changed the behavior of the connect() timeout at all on Linux.
Hello. Is there any update on this? I encountered the same problem. It also happens with my own custom asynPortDriver when I call pasynOctetSyncIO->setInputEos(asynUser *pasynUser, const char *eos,int eoslen) on a port newly created with drvAsynIPPortConfigure. With asynTraceMask I get this:
2024/09/25 18:19:45.225 KPOWTESTF_TCP lockPort
2024/09/25 18:21:55.843 KPOWTESTF_TCP -1 autoConnect
2024/09/25 18:21:55.843 KPOWTESTF_TCP set Eos 1
So asyn tries to lock the port and gets a timeout after 130 seconds, the same as you did in your tests. My problem is, that if I have a huge number of devices and a part of them is offline, these timeouts will add up in case of IOC reboots. Any idea how to work around this?
Hi Mark. I tested your code and it works! The problem with your test was, that we have to use SO_SNDTIMEO instead of SO_RCVTIMEO. SO_RCVTIMEO will affect read, recvmsg and accept, while SO_SNDTIMEO will affect send, sendmsg and connect. I attached a patch (as tar.gz because uploading *.patch files from linux is bugged on GitHub) with the changes you suggested. One question remains to me. Why do we have to wait for connection in the first place to set the input / output EOS? Isn't this independent of the connection status? commit-7604059.patch.tar.gz
Why do we have to wait for connection in the first place to set the input / output EOS? Isn't this independent of the connection status?
That is a reasonable question since the R4-22 release notes (2013) say this: asynOctetSyncIO Use simple lock/unlock operations rather than queueLockPort/queueUnlockPort for end-of-string manipulations (setInputEos, getInputEos, setOutputEos, getOutputEos). This ensures that these operations can take place even with the device disconnected.
I think the explanation is that there are 2 cases:
- The port has previously failed to connect and one calls asynOctet->setInputEos. In this case the port is not currently locked and the setInputEos will succeed quickly because lockPort will succeed.
- The call to pasynOctet->setInputEos occurs while the initial connect() call in drvAsynIPPort is hung. In that case the port is locked, and so setInputEos needs to wait for the connect to time out.
Thanks for finding the error in my previous attempt to shorten the timeout for connections to ports on other subnets. I will test and push the fix.
@ebernalru I have created a PR (#210) based on your patch file. I removed the debugging print statements from the patch version, and I changed the cast for the timeval struct member to (long) so it would also work on Windows.