beam icon indicating copy to clipboard operation
beam copied to clipboard

[Bug]: grpc connection issue

Open yeandy opened this issue 2 years ago • 3 comments

What happened?

Seeing this error in Python PostCommits

21:36:07 Exception in thread read_grpc_client_inputs:
21:36:07 Traceback (most recent call last):
21:36:07   File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
21:36:07     self.run()
21:36:07   File "/usr/lib/python3.7/threading.py", line 870, in run
21:36:07     self._target(*self._args, **self._kwargs)
21:36:07   File "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python37_PR/src/sdks/python/apache_beam/runners/worker/data_plane.py", line 671, in <lambda>
21:36:07     target=lambda: self._read_inputs(elements_iterator),
21:36:07   File "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python37_PR/src/sdks/python/apache_beam/runners/worker/data_plane.py", line 654, in _read_inputs
21:36:07     for elements in elements_iterator:
21:36:07   File "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python37_PR/src/build/gradleenv/2022703441/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
21:36:07     return self._next()
21:36:07   File "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python37_PR/src/build/gradleenv/2022703441/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
21:36:07     raise self
21:36:07 grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
21:36:07 	status = StatusCode.CANCELLED
21:36:07 	details = "Multiplexer hanging up"
21:36:07 	debug_error_string = "{"created":"@1660012354.173326998","description":"Error received from peer ipv6:[::1]:37919","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Multiplexer hanging up","grpc_status":1}"

Or like this

15:30:23 Exception in thread read_state:
15:30:24 Traceback (most recent call last):
15:30:24   File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
15:30:24 Exception in thread run_worker_1-1:
15:30:24 Traceback (most recent call last):
15:30:24   File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
15:30:24 ERROR:apache_beam.runners.worker.data_plane:Failed to read inputs in the data plane.
15:30:24 Traceback (most recent call last):
15:30:24   File "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python38/src/sdks/python/apache_beam/runners/worker/data_plane.py", line 654, in _read_inputs
15:30:24     for elements in elements_iterator:
15:30:24   File "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python38/src/build/gradleenv/2022703442/lib/python3.8/site-packages/grpc/_channel.py", line 426, in __next__
15:30:24     return self._next()
15:30:24   File "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python38/src/build/gradleenv/2022703442/lib/python3.8/site-packages/grpc/_channel.py", line 826, in _next
15:30:24     raise self
15:30:24 grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
15:30:24 	status = StatusCode.UNAVAILABLE
15:30:24 	details = "Socket closed"
15:30:24 	debug_error_string = "{"created":"@1659990624.371994304","description":"Error received from peer ipv6:[::1]:42753","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Socket closed","grpc_status":14}"
15:30:24 >
15:30:24 Exception in thread read_grpc_client_inputs:
15:30:24 Traceback (most recent call last):
15:30:24   File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
15:30:24     self.run()
15:30:24   File "/usr/lib/python3.8/threading.py", line 870, in run
15:30:24     self.run()
15:30:24   File "/usr/lib/python3.8/threading.py", line 870, in run
15:30:24     self.run()
15:30:24   File "/usr/lib/python3.8/threading.py", line 870, in run
15:30:24     self._target(*self._args, **self._kwargs)
15:30:24   File "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python38/src/sdks/python/apache_beam/runners/worker/sdk_worker.py", line 251, in run
15:30:24     self._target(*self._args, **self._kwargs)
15:30:24   File "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python38/src/sdks/python/apache_beam/runners/worker/sdk_worker.py", line 1008, in pull_responses
15:30:24     self._target(*self._args, **self._kwargs)
15:30:24   File "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python38/src/sdks/python/apache_beam/runners/worker/data_plane.py", line 671, in <lambda>
15:30:24     for work_request in self._control_stub.Control(get_responses()):
15:30:24   File "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python38/src/build/gradleenv/2022703442/lib/python3.8/site-packages/grpc/_channel.py", line 426, in __next__
15:30:24     for response in responses:
15:30:24   File "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python38/src/build/gradleenv/2022703442/lib/python3.8/site-packages/grpc/_channel.py", line 426, in __next__
15:30:24     target=lambda: self._read_inputs(elements_iterator),
15:30:24   File "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python38/src/sdks/python/apache_beam/runners/worker/data_plane.py", line 654, in _read_inputs
15:30:24     return self._next()
15:30:24   File "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python38/src/build/gradleenv/2022703442/lib/python3.8/site-packages/grpc/_channel.py", line 826, in _next
15:30:24     return self._next()
15:30:24   File "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python38/src/build/gradleenv/2022703442/lib/python3.8/site-packages/grpc/_channel.py", line 826, in _next
15:30:24     raise self
15:30:24 grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
15:30:24 	status = StatusCode.UNAVAILABLE
15:30:24 	details = "Connection reset by peer"
15:30:24 	debug_error_string = "{"created":"@1659990624.372101270","description":"Error received from peer ipv6:[::1]:40577","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Connection reset by peer","grpc_status":14}"
15:30:24 >
15:30:24     for elements in elements_iterator:
15:30:24   File "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python38/src/build/gradleenv/2022703442/lib/python3.8/site-packages/grpc/_channel.py", line 426, in __next__
15:30:24     raise self
15:30:24 grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
15:30:24 	status = StatusCode.UNAVAILABLE
15:30:24 	details = "Socket closed"
15:30:24 	debug_error_string = "{"created":"@1659990624.372044384","description":"Error received from peer ipv6:[::1]:43851","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Socket closed","grpc_status":14}"

Issue Priority

Priority: 2

Issue Component

Component: test-failures

yeandy avatar Aug 09 '22 13:08 yeandy

I'm not sure that's the root cause of the timeout, I've seen errors like that before, they typically are the root cause of an actual test failure. In this case it looks like we've been flirting with the three hour timeout for a while, e.g. this run from last week took 2h55m: https://ci-beam.apache.org/job/beam_PostCommit_Python37/5552/

I'd suggest we first double the timeout for these PostCommits. Then we can come back and try to root cause the increase in runtime.

TheNeuralBit avatar Aug 09 '22 15:08 TheNeuralBit

Yeah I've seen these before too, but it was never clear the culprits. That makes sense. Do you mean to double timeout for ExerciseMetricsPipelineTest tests, and not all Python postcommits?

yeandy avatar Aug 09 '22 16:08 yeandy

No I mean for all Python PostCommits (your link is correct), since that's the timeout we're hitting. Per @kennknowles - timeouts are for catching runaway jobs, not for punishing us when the jobs start taking too long. The timeout shouldn't be that close to the actual runtime of the job.

TheNeuralBit avatar Aug 09 '22 16:08 TheNeuralBit

As tests taking longer and longer, we may also want to split some IO tests from the generic postcommit as the ongoing work in java...

Abacn avatar Aug 10 '22 17:08 Abacn

Yeah I agree that we should do something to address the long-running PostCommits after this is resolved. We've had some discussions about splitting up the PostCommit like Java is doing.

TheNeuralBit avatar Aug 10 '22 20:08 TheNeuralBit