[Failing Test]: dataflow runner worker project test stuck causing Java PreCommit time out
What happened?
It appears that PreCommit timeout happens more frequently recently.
the succeeded job run after the timed out run shows the following tests are added, which suggests the stuck project is runners:google-cloud-dataflow-java:worker:
| Package | Duration | Fail | (diff) | Skip | (diff) | Pass | (diff) | Total | (diff) |
|---|---|---|---|---|---|---|---|---|---|
| org.apache.beam.runners.dataflow.harness.test | 10 sec | 0 | 0 | 9 | +9 | 9 | +9 | ||
| org.apache.beam.runners.dataflow.options | 15 sec | 0 | 0 | 25 | 25 | ||||
| org.apache.beam.runners.dataflow.transforms | 9.9 sec | 0 | 0 | 11 | 11 | ||||
| org.apache.beam.runners.dataflow.util | 1.9 sec | 0 | 0 | 144 | 144 | ||||
| org.apache.beam.runners.dataflow.worker | 6 min 7 sec | 0 | 2 | +2 | 493 | +493 | 495 | +495 | |
| org.apache.beam.runners.dataflow.worker.apiary | 1 ms | 0 | 0 | 6 | +6 | 6 | +6 | ||
| org.apache.beam.runners.dataflow.worker.counters | 12 ms | 0 | 0 | 22 | +22 | 22 | +22 | ||
| org.apache.beam.runners.dataflow.worker.graph | 98 ms | 0 | 0 | 33 | +33 | 33 | +33 | ||
| org.apache.beam.runners.dataflow.worker.logging | 0.24 sec | 0 | 0 | 37 | +37 | 37 | +37 | ||
| org.apache.beam.runners.dataflow.worker.profiler | 3 ms | 0 | 0 | 4 | +4 | 4 | +4 | ||
| org.apache.beam.runners.dataflow.worker.status | 40 sec | 0 | 0 | 7 | +7 | 7 | +7 | ||
| org.apache.beam.runners.dataflow.worker.streaming | 10 sec | 0 | 0 | 18 | +18 | 18 | +18 | ||
| org.apache.beam.runners.dataflow.worker.testing | 2 ms | 0 | 0 | 3 | +3 | 3 | +3 | ||
| org.apache.beam.runners.dataflow.worker.util | 1 min 3 sec | 0 | 1 | +1 | 52 | +52 | 53 | +53 | |
| org.apache.beam.runners.dataflow.worker.util.common | 22 ms | 0 | 0 | 6 | +6 | 6 | +6 | ||
| org.apache.beam.runners.dataflow.worker.util.common.worker | 1.4 sec | 0 | 0 | 84 | +84 | 84 | +84 | ||
| org.apache.beam.runners.dataflow.worker.windmill | 9 ms | 0 | 0 | 9 | +9 | 9 | +9 | ||
| org.apache.beam.runners.dataflow.worker.windmill.grpcclient | 11 sec | 0 | 0 | 13 | +13 | 13 | +13 | ||
| org.apache.beam.runners.dataflow.worker.windmill.state | 5 min 7 sec | 0 | 0 | 127 | +127 | 127 | +127 |
see https://ci-beam.apache.org/view/PostCommit/job/beam_PreCommit_Java_Cron/7455/testReport/
Issue Failure
Failure: Test is flaky
Issue Priority
Priority: 2 (backlog / disabled test but we think the product is healthy)
Issue Components
- [ ] Component: Python SDK
- [X] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [ ] Component: Beam YAML
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner
The original stucked test was fixed. Now It appears there is another stuck test:
testOnNewWorkerMetadata_correctlyRemovesStaleWindmillServers (org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClientTest) failed
runners/google-cloud-dataflow-java/worker/build/test-results/test/TEST-org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClientTest.xml [took 10m 36s]
org.junit.runners.model.TestTimedOutException: test timed out after 600 seconds
at org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClientTest.waitForWorkerMetadataToBeConsumed(StreamingEngineClientTest.java:349)
at org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClientTest.testOnNewWorkerMetadata_correctlyRemovesStaleWindmillServers(StreamingEngineClientTest.java:287)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:750)
It was added in #28835
Another occurrance: https://github.com/apache/beam/pull/29723/checks?check_run_id=19571889999
this happens in intermediate frequency (>10%)
Bump to P1 as this is somewhat frequent, and not yet clear if this suggests some regression on the streaming worker
@Abacn Can the test be disabled or sickbayed before it is investigated? This code is not yet used in pipelines.
If it is not used in actual pipeline, its of lower risk and yes could be disabled or sickbayed
update: downgrade to P2 for now
Yea let's disable. This has caused weeks of delay for some major changes.
Currently the test is disabled. Leave this bug open to track the fixing test TODO
StreamingEngineClientTest.testScheduledBudgetRefresh is also flaky: https://github.com/apache/beam/runs/19993891236
also added in #28835
@m-trieu Martin can you look into fixing the test flakiness?
testStreamsStartCorrectly (org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClientTest) failed
org.junit.runners.model.TestTimedOutException: test timed out after 600 seconds
at org.apache.beam.runners.dataflow.worker.windmill.client.AbstractWindmillStream.close(AbstractWindmillStream.java)
at org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClient.finish(StreamingEngineClient.java:236)
at org.apache.beam.runners.dataflow.worker.windmill.client.grpc.StreamingEngineClientTest.cleanUp(StreamingEngineClientTest.java:172)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
https://github.com/apache/beam/pull/30245/checks?check_run_id=21501494900
will take a look today
have a potential fix https://github.com/apache/beam/pull/30322
Other flaky test:
testLatencyAttributionToQueuedState: https://github.com/apache/beam/runs/22270690743
java.lang.AssertionError: expected:<PT0S> but was:<PT1S>
at org.junit.Assert.assertEquals(Assert.java:146)
at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorkerTest.testLatencyAttributionToQueuedState(StreamingDataflowWorkerTest.java:3444)
testInvalidateStuckCommits: https://github.com/apache/beam/runs/22276370706
Wanted but not invoked:
forComputation.invalidate(
<ByteString@3e931ad0 size=0 contents="">,
0L
);
-> at org.apache.beam.runners.dataflow.worker.windmill.work.refresh.DispatchedActiveWorkRefresherTest.testInvalidateStuckCommits(DispatchedActiveWorkRefresherTest.java:231)
Actually, there were zero interactions with this mock.
at org.apache.beam.runners.dataflow.worker.windmill.work.refresh.DispatchedActiveWorkRefresherTest.testInvalidateStuckCommits(DispatchedActiveWorkRefresherTest.java:231)