The PostRelease Nightly Snapshot job is flaky
The PostRelease Nightly Snapshot is failing over 50% of the time Please visit https://github.com/apache/beam/actions/workflows/beam_PostRelease_NightlySnapshot.yml?query=is%3Afailure+branch%3Amaster to see the logs.
Related to ##30447
Still failing:
Container image gcr.io/cloud-dataflow/v1beta3/beam_java8_sdk:beam-master-20240306 not downloaded yet.
It is strange that the container gets resolved to "beam_java8_sdk:beam-master-20240306". What happens is it picks the label for legacy runner but actually trying to pull runner v2 image. This is likely due to Dataflow switched to runner v2 by default in Beam 2.55.0+
https://github.com/apache/beam/blob/ef919e2603fcd6bffde2a15961d1f186448520a9/runners/google-cloud-dataflow-java/build.gradle#L54-L55
entered #30634
https://github.com/apache/beam/actions/runs/8619063045
java.lang.RuntimeException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
POST https://bigquery.googleapis.com/bigquery/v2/projects/apache-beam-testing/datasets/beam_postrelease_mobile_gaming/tables/leaderboard_DataflowRunner_team/insertAll?prettyPrint=false
{
"code" : 404,
"errors" : [ {
"domain" : "global",
"message" : "Not found: Table apache-beam-testing:beam_postrelease_mobile_gaming.leaderboard_DataflowRunner_team",
"reason" : "notFound"
} ],
"message" : "Not found: Table apache-beam-testing:beam_postrelease_mobile_gaming.leaderboard_DataflowRunner_team",
"status" : "NOT_FOUND"
}
Looks much better. Close this now.
Currently there is a flakiness due to downloading artifacts from maven snapshot repository not get retried. This is a maven tool thing, but probably we can first build (with retry) so the artifacts are get cached in local maven
@shunping please check this when you have time.
Related to the maven snapshot issue. I wonder if we could use artifact registry's ability to store Java packages https://cloud.google.com/artifact-registry/docs/java/store-java, instead of relying on maven central.
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project word-count-beam: An exception occured while executing the Java class. java.lang.RuntimeException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found |
-- | --
| [ERROR] POST https://bigquery.googleapis.com/bigquery/v2/projects/apache-beam-testing/datasets/beam_postrelease_mobile_gaming/tables/leaderboard_DirectRunner_team/insertAll?prettyPrint=false |
| [ERROR] { |
| [ERROR] "code" : 404, |
| [ERROR] "errors" : [ { |
| [ERROR] "domain" : "global", |
| [ERROR] "message" : "Not found: table Table is deleted: 844138762903:beam_postrelease_mobile_gaming.leaderboard_DirectRunner_team", |
| [ERROR] "reason" : "notFound" |
| [ERROR] } ], |
| [ERROR] "message" : "Not found: table Table is deleted: 844138762903:beam_postrelease_mobile_gaming.leaderboard_DirectRunner_team", |
| [ERROR] "status" : "NOT_FOUND" |
| [ERROR] } |
| [ERROR] -> [Help 1] |
| [ERROR]
Can we just add the retry to this task?
Looking at some of the recent failures seems like Java command was just crashing ?
https://github.com/apache/beam/actions/runs/9537373049/job/26285395593 https://ge.apache.org/s/pmba6vnub3yz4
"Process 'command '/opt/hostedtoolcache/Java_Temurin-Hotspot_jdk/8.0.412-8/x64/bin/java'' finished with non-zero exit value 1"
I also see the 404 error from BQ mentioned above in other failed runs, so seems like there are at least two failure modes.
I wonder if Java failure was due to an OOM. Can we increase the memory available to VMs running these tests ?
Trying this with #31749
Reopening since the workflow is still flaky
Green now.
Opening again because the workflow is broken for the past few days.
Since #33555 got in, it's been failing with the following:
Caused by: java.lang.NoClassDefFoundError: org/apache/beam/vendor/grpc/v1p60p1/io/grpc/Channel
I wonder if this is due to SDK snapshots failing (see #32161). Optimizing snapshots may fix this error.
Behind it is another error however (seen in workflow runs Jan 22-24):
{
"code": 404,
"errors": [
{
"domain": "global",
"message": "Not found: table Table is deleted: 844138762903:beam_postrelease_mobile_gaming.leaderboard_DataflowRunner_team",
"reason": "notFound"
}
],
"message": "Not found: table Table is deleted: 844138762903:beam_postrelease_mobile_gaming.leaderboard_DataflowRunner_team",
"status": "NOT_FOUND"
}
Should we increase the time out again? Similar to here: https://github.com/apache/beam/pull/30949/files
I am going to fix mobilegaming groovy scripts
Seems like this is still flaky after recent fixes unfortunately.
Hi @Amar3tto do you have insight what had caused MobileGaming test being flaky before? I read the fix mostly adding retries, which could make test less flaky, but if there is recent change causing it start failing, the root cause wasn't resolved.
Hi @Amar3tto do you have insight what had caused MobileGaming test being flaky before? I read the fix mostly adding retries, which could make test less flaky, but if there is recent change causing it start failing, the root cause wasn't resolved.
I tried changing the Groovy scripts, also tried explicitly creating the table, but it didn't help. I think there might be a problem with the threads, since the table name doesn't contain any unique ID (leaderboard_DataflowRunner_team). I'm going to investigate further.
Reopening since the workflow is still flaky
Culprit: #33086
Fixed by #34447
Reopening since the workflow is still flaky
Reopening since the workflow is still flaky
Currently due to connection issue to maven snapshot repository: https://issues.apache.org/jira/browse/INFRA-26230?filter=-2
Job ID: 2025-06-04_11_15_55-15227603184381661966
java.lang.NoSuchFieldError: java8 at org.apache.beam.runners.dataflow.worker.util.MemoryMonitor.<init>(MemoryMonitor.java:318) at org.apache.beam.runners.dataflow.worker.util.MemoryMonitor.fromOptions(MemoryMonitor.java:268) at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.fromOptions(StreamingDataflowWorker.java:444) at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.main(StreamingDataflowWorker.java:811)
@Abacn Do you have any thoughts on this error?
Need to release a new beam-master container that uses latest nightly. Let me do that
The same error here: #35194
Job ID: 2025-06-04_11_15_55-15227603184381661966
java.lang.NoSuchFieldError: java8 at org.apache.beam.runners.dataflow.worker.util.MemoryMonitor.<init>(MemoryMonitor.java:318) at org.apache.beam.runners.dataflow.worker.util.MemoryMonitor.fromOptions(MemoryMonitor.java:268) at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.fromOptions(StreamingDataflowWorker.java:444) at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.main(StreamingDataflowWorker.java:811)@Abacn Do you have any thoughts on this error?