beam icon indicating copy to clipboard operation
beam copied to clipboard

[Failing Test]: dataflow runner worker project test stuck causing Java PreCommit time out

Open Abacn opened this issue 2 years ago • 12 comments

What happened?

It appears that PreCommit timeout happens more frequently recently.

the succeeded job run after the timed out run shows the following tests are added, which suggests the stuck project is runners:google-cloud-dataflow-java:worker:

Package Duration Fail (diff) Skip (diff) Pass (diff) Total (diff)
org.apache.beam.runners.dataflow.harness.test 10 sec 0   0   9 +9 9 +9
org.apache.beam.runners.dataflow.options 15 sec 0   0   25   25  
org.apache.beam.runners.dataflow.transforms 9.9 sec 0   0   11   11  
org.apache.beam.runners.dataflow.util 1.9 sec 0   0   144   144  
org.apache.beam.runners.dataflow.worker 6 min 7 sec 0   2 +2 493 +493 495 +495
org.apache.beam.runners.dataflow.worker.apiary 1 ms 0   0   6 +6 6 +6
org.apache.beam.runners.dataflow.worker.counters 12 ms 0   0   22 +22 22 +22
org.apache.beam.runners.dataflow.worker.graph 98 ms 0   0   33 +33 33 +33
org.apache.beam.runners.dataflow.worker.logging 0.24 sec 0   0   37 +37 37 +37
org.apache.beam.runners.dataflow.worker.profiler 3 ms 0   0   4 +4 4 +4
org.apache.beam.runners.dataflow.worker.status 40 sec 0   0   7 +7 7 +7
org.apache.beam.runners.dataflow.worker.streaming 10 sec 0   0   18 +18 18 +18
org.apache.beam.runners.dataflow.worker.testing 2 ms 0   0   3 +3 3 +3
org.apache.beam.runners.dataflow.worker.util 1 min 3 sec 0   1 +1 52 +52 53 +53
org.apache.beam.runners.dataflow.worker.util.common 22 ms 0   0   6 +6 6 +6
org.apache.beam.runners.dataflow.worker.util.common.worker 1.4 sec 0   0   84 +84 84 +84
org.apache.beam.runners.dataflow.worker.windmill 9 ms 0   0   9 +9 9 +9
org.apache.beam.runners.dataflow.worker.windmill.grpcclient 11 sec 0   0   13 +13 13 +13
org.apache.beam.runners.dataflow.worker.windmill.state 5 min 7 sec 0   0   127 +127 127 +127

see https://ci-beam.apache.org/view/PostCommit/job/beam_PreCommit_Java_Cron/7455/testReport/

Issue Failure

Failure: Test is flaky

Issue Priority

Priority: 2 (backlog / disabled test but we think the product is healthy)

Issue Components

  • [ ] Component: Python SDK
  • [X] Component: Java SDK
  • [ ] Component: Go SDK
  • [ ] Component: Typescript SDK
  • [ ] Component: IO connector
  • [ ] Component: Beam YAML
  • [ ] Component: Beam examples
  • [ ] Component: Beam playground
  • [ ] Component: Beam katas
  • [ ] Component: Website
  • [ ] Component: Spark Runner
  • [ ] Component: Flink Runner
  • [ ] Component: Samza Runner
  • [ ] Component: Twister2 Runner
  • [ ] Component: Hazelcast Jet Runner
  • [ ] Component: Google Cloud Dataflow Runner

Abacn avatar Oct 12 '23 03:10 Abacn

The original stucked test was fixed. Now It appears there is another stuck test:

testOnNewWorkerMetadata_correctlyRemovesStaleWindmillServers (org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClientTest) failed
runners/google-cloud-dataflow-java/worker/build/test-results/test/TEST-org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClientTest.xml [took 10m 36s]

org.junit.runners.model.TestTimedOutException: test timed out after 600 seconds
	at org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClientTest.waitForWorkerMetadataToBeConsumed(StreamingEngineClientTest.java:349)
	at org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClientTest.testOnNewWorkerMetadata_correctlyRemovesStaleWindmillServers(StreamingEngineClientTest.java:287)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.lang.Thread.run(Thread.java:750)

It was added in #28835

Abacn avatar Dec 12 '23 19:12 Abacn

Another occurrance: https://github.com/apache/beam/pull/29723/checks?check_run_id=19571889999

this happens in intermediate frequency (>10%)

Abacn avatar Dec 12 '23 19:12 Abacn

Bump to P1 as this is somewhat frequent, and not yet clear if this suggests some regression on the streaming worker

Abacn avatar Dec 12 '23 19:12 Abacn

@Abacn Can the test be disabled or sickbayed before it is investigated? This code is not yet used in pipelines.

scwhittle avatar Dec 12 '23 20:12 scwhittle

If it is not used in actual pipeline, its of lower risk and yes could be disabled or sickbayed

update: downgrade to P2 for now

Abacn avatar Dec 12 '23 20:12 Abacn

Yea let's disable. This has caused weeks of delay for some major changes.

kennknowles avatar Dec 13 '23 15:12 kennknowles

Currently the test is disabled. Leave this bug open to track the fixing test TODO

Abacn avatar Dec 18 '23 19:12 Abacn

StreamingEngineClientTest.testScheduledBudgetRefresh is also flaky: https://github.com/apache/beam/runs/19993891236

also added in #28835

Abacn avatar Dec 28 '23 18:12 Abacn

@m-trieu Martin can you look into fixing the test flakiness?

scwhittle avatar Jan 03 '24 16:01 scwhittle

testStreamsStartCorrectly (org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClientTest) failed

org.junit.runners.model.TestTimedOutException: test timed out after 600 seconds
	at org.apache.beam.runners.dataflow.worker.windmill.client.AbstractWindmillStream.close(AbstractWindmillStream.java)
	at org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClient.finish(StreamingEngineClient.java:236)
	at org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClientTest.cleanUp(StreamingEngineClientTest.java:172)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

https://github.com/apache/beam/pull/30245/checks?check_run_id=21501494900

Abacn avatar Feb 14 '24 15:02 Abacn

will take a look today

m-trieu avatar Feb 14 '24 21:02 m-trieu

have a potential fix https://github.com/apache/beam/pull/30322

m-trieu avatar Feb 15 '24 06:02 m-trieu

Other flaky test:

testLatencyAttributionToQueuedState: https://github.com/apache/beam/runs/22270690743

java.lang.AssertionError: expected:<PT0S> but was:<PT1S>
	at org.junit.Assert.assertEquals(Assert.java:146)
	at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorkerTest.testLatencyAttributionToQueuedState(StreamingDataflowWorkerTest.java:3444)

testInvalidateStuckCommits: https://github.com/apache/beam/runs/22276370706

Wanted but not invoked:
forComputation.invalidate(
    <ByteString@3e931ad0 size=0 contents="">,
    0L
);
-> at org.apache.beam.runners.dataflow.worker.windmill.work.refresh.DispatchedActiveWorkRefresherTest.testInvalidateStuckCommits(DispatchedActiveWorkRefresherTest.java:231)
Actually, there were zero interactions with this mock.

	at org.apache.beam.runners.dataflow.worker.windmill.work.refresh.DispatchedActiveWorkRefresherTest.testInvalidateStuckCommits(DispatchedActiveWorkRefresherTest.java:231)

Abacn avatar Mar 06 '24 22:03 Abacn