ucx Endpoint timeout (error code -80) seen after upgrading to UCX 1.14.0

Endpoint timeout (error code -80) seen after upgrading to UCX 1.14.0

Open abellina opened this issue 1 year ago • 4 comments

Our CI detected an issue that I didn't see while manually testing with UCX 1.14.0: https://github.com/NVIDIA/spark-rapids/issues/7940

Essentially we are loosing endpoints and the only error we get in our listener is that there was a timeout.

This started to happen after we upgraded to UCX 1.14.0. The version we were using before was 1.12.1.

Any pointers on what may have changed related to different timeout (keepalive?) error handling would be great.

Mar 27 '23 15:03 abellina

@abellina is this issue still relevant?

Apr 13 '23 14:04 evgeny-leksikov

I have been able to repro it with UCX 1.14.0 and JUCX 1.12.1. I sent logs privately so I think it is still relevant.

Apr 13 '23 15:04 abellina

We have started seeing this issue as well with an upgrade to 1.14.1

Jun 12 '23 18:06 supunkamburugamuve

We have done several tests to try and repro this, especially around the keepalive configuration on the host and for UCX.

At this stage we are getting 0 failures, but the system has had a reboot, and we have a version of UCX that @evgeny-leksikov had prepared. Our next step will be to move to UCX 1.15 as released, we'll update here if anything changes. Unfortunately, none of the investigation we have done has yielded a root cause.

Oct 31 '23 14:10 abellina

ucx ucx copied to clipboard

Endpoint timeout (error code -80) seen after upgrading to UCX 1.14.0

ucx
ucx copied to clipboard