ucx
ucx copied to clipboard
Endpoint timeout (error code -80) seen after upgrading to UCX 1.14.0
Our CI detected an issue that I didn't see while manually testing with UCX 1.14.0: https://github.com/NVIDIA/spark-rapids/issues/7940
Essentially we are loosing endpoints and the only error we get in our listener is that there was a timeout.
This started to happen after we upgraded to UCX 1.14.0. The version we were using before was 1.12.1.
Any pointers on what may have changed related to different timeout (keepalive?) error handling would be great.
@abellina is this issue still relevant?
I have been able to repro it with UCX 1.14.0 and JUCX 1.12.1. I sent logs privately so I think it is still relevant.
We have started seeing this issue as well with an upgrade to 1.14.1
We have done several tests to try and repro this, especially around the keepalive configuration on the host and for UCX.
At this stage we are getting 0 failures, but the system has had a reboot, and we have a version of UCX that @evgeny-leksikov had prepared. Our next step will be to move to UCX 1.15 as released, we'll update here if anything changes. Unfortunately, none of the investigation we have done has yielded a root cause.