beam icon indicating copy to clipboard operation
beam copied to clipboard

The PostCommit TransformService Direct job is flaky

Open github-actions[bot] opened this issue 1 year ago • 6 comments

The PostCommit TransformService Direct is failing over 50% of the time Please visit https://github.com/apache/beam/actions/workflows/beam_PostCommit_TransformService_Direct.yml?query=is%3Afailure+branch%3Amaster to see the logs.

github-actions[bot] avatar Apr 13 '24 15:04 github-actions[bot]

https://github.com/apache/beam/pull/30816 breaks the build here.

#10 43.81 INFO: pip is looking at multiple versions of apache-beam[dataframe,gcp] to determine which version is compatible with other requirements. This could take a while.
#10 43.81 ERROR: Cannot install apache-beam[dataframe,gcp]==2.56.0.dev0 because these package versions have conflicting dependencies.
#10 43.81 
#10 43.81 The conflict is caused by:
#10 43.81     apache-beam[dataframe,gcp] 2.56.0.dev0 depends on google-auth-httplib2<0.2.0 and >=0.1.0; extra == "gcp"
#10 43.81     The user requested (constraint) google-auth-httplib2==0.2.0

we have 'google-auth-httplib2>=0.1.0,<0.2.0' in https://github.com/apache/beam/blob/master/sdks/python/setup.py#L445 .

liferoad avatar Apr 13 '24 18:04 liferoad

Reopening since the workflow is still flaky

github-actions[bot] avatar Aug 18 '24 15:08 github-actions[bot]

Reopening since the workflow is still flaky

github-actions[bot] avatar Aug 21 '24 09:08 github-actions[bot]

من مبتدی هستم و زیاد چیزی نمی دانم و می خواهم یاد بگیرم

در تاریخ چهارشنبه ۲۱ اوت ۲۰۲۴،‏ ۱۳:۰۴ github-actions[bot] < @.***> نوشت:

Reopened #30960 https://github.com/apache/beam/issues/30960.

— Reply to this email directly, view it on GitHub https://github.com/apache/beam/issues/30960#event-13957494517, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5VBXF3REVJOQ5ZJMLVS4N3ZSRNKVAVCNFSM6AAAAABGFRSTSGVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTHE2TONBZGQ2TCNY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

shahine44 avatar Aug 21 '24 09:08 shahine44

a random bigtableio_it_test test failing a time, looks like the tests having racing condition running in parallel on same machine (port occupied?)

Abacn avatar Aug 21 '24 15:08 Abacn

Reopening since the workflow is still flaky

github-actions[bot] avatar Aug 28 '24 09:08 github-actions[bot]

Reopening since the workflow is still flaky

github-actions[bot] avatar Mar 20 '25 03:03 github-actions[bot]

Reopening since the workflow is still flaky

github-actions[bot] avatar Sep 30 '25 15:09 github-actions[bot]

Stabilized

Amar3tto avatar Oct 01 '25 05:10 Amar3tto

Reopening since the workflow is still flaky

github-actions[bot] avatar Oct 04 '25 21:10 github-actions[bot]

Python 3.13 tests failing due to _InactiveRpcError:

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = "Application error processing RPC"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Application error processing RPC", grpc_status:2}"
>
self = <apache_beam.io.gcp.bigtableio_it_test.TestWriteToBigtableXlangIT testMethod=test_set_mutation>
...
>     self.run_pipeline([row1, row2])
>       raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E       grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
E       	status = StatusCode.UNKNOWN
E       	details = "Application error processing RPC"
E       	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Application error processing RPC", grpc_status:2}"
E       >

../../build/gradleenv/1922375555/lib/python3.13/site-packages/grpc/_channel.py:996: _InactiveRpcError

https://github.com/apache/beam/runs/52803653779

Abacn avatar Oct 15 '25 15:10 Abacn

breaking since Oct 3, likely after switching to Python 3.13 (#35056) cc: @jrmccluskey @tvalentyn what could be causing grpc error in Python 3.13 alone?

Abacn avatar Oct 15 '25 15:10 Abacn

There's a bug in grpc 1.66.0+ that can cause timeouts, only Python 3.13 uses a version beyond this out of necessity (prior releases do not support 3.13.) #36525 includes experiments that are supposed to mitigate this problem and drops them into our dockerfile, which will hopefully make the 3.13 tests mores stable

jrmccluskey avatar Oct 15 '25 15:10 jrmccluskey

I tested locally (command: pytest -v -s apache_beam/io/gcp/bigtableio_it_test.py::TestWriteToBigtableXlangIT::test_set_mutation --test-pipeline-options="--runner=TestDirectRunner --project=apache-beam-testing"). I get the same error while my machine missing docker. Likely this is simply due to transform service not successfully turned up

there is no hint what happened for GitHub Action due to that the output of transform service has been redirected:

https://github.com/apache/beam/blob/d687f4fe8170b6eb4c82e02419702d5a20eb456e/sdks/python/scripts/run_transform_service.sh#L79

Abacn avatar Oct 15 '25 15:10 Abacn

This is most likely an infra issue (tests passed on 3.9 variant but not 3.13) so move off release blocker, however we still need to fix the test as a PostCommit. To investigate One may need to remove the stdout/stderr redirect, or upload the log file at the end of the workflow to see what happened for the transform service @Amar3tto @aIbrahiim

Abacn avatar Oct 15 '25 16:10 Abacn

Haven't reproduced it 100% locally, but in local it shows the following error:

RuntimeError: The grpc package installed is at version 1.65.5, but the generated code in org/apache/beam/model/pipeline/v1/standard_window_fns_pb2_grpc.py depends on grpcio>=1.71.0. Please upgrade your grpc module to grpcio>=1.71.0 or downgrade your generated code using grpcio-tools<=1.65.5.
Resolve mutations for :sdks:python:test-suites:direct:xlang:fnApiJobServerCleanup (Thread[#194,Execution worker Thread 21,5,main]) started.
:sdks:python:test-suites:direct:xlang:fnApiJobServerCleanup (Thread[#194,Execution worker Thread 21,5,main]) started.

what I suspect is that the test compiles grpc code under Python 3.9, but then run Beam at Python 3.13, causing similar conflict. This explains why Python 3.12 used to work but the test started failing since Python 3.13

@aIbrahiim @jrmccluskey can we partly revert TransformService test change in #35056 to make it run on Python 3.9 and 3.12, until all Python versions can have a shared, and working grpc ?

Abacn avatar Nov 04 '25 21:11 Abacn

I see that as a fault with the test. We can't necessarily control when different python versions will require different dependency versions (and unless the python foundation changes the release cadence for versions we will be trying to support 4-5 python versions simultaneously for the foreseeable future) but we can make sure that our tests match what users get when they install beam at a specific version.

In short, I'd prefer that we have the 3.13 version of the test re-build grpc with python 3.13

jrmccluskey avatar Nov 04 '25 22:11 jrmccluskey