beam The PostRelease Nightly Snapshot job is flaky

The PostRelease Nightly Snapshot is failing over 50% of the time Please visit https://github.com/apache/beam/actions/workflows/beam_PostRelease_NightlySnapshot.yml?query=is%3Afailure+branch%3Amaster to see the logs.

Mar 05 '24 18:03 github-actions[bot]

Related to ##30447

Mar 11 '24 20:03 shunping

Still failing:

Container image gcr.io/cloud-dataflow/v1beta3/beam_java8_sdk:beam-master-20240306 not downloaded yet.

It is strange that the container gets resolved to "beam_java8_sdk:beam-master-20240306". What happens is it picks the label for legacy runner but actually trying to pull runner v2 image. This is likely due to Dataflow switched to runner v2 by default in Beam 2.55.0+

https://github.com/apache/beam/blob/ef919e2603fcd6bffde2a15961d1f186448520a9/runners/google-cloud-dataflow-java/build.gradle#L54-L55

entered #30634

Mar 14 '24 14:03 Abacn

https://github.com/apache/beam/actions/runs/8619063045

java.lang.RuntimeException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
POST https://bigquery.googleapis.com/bigquery/v2/projects/apache-beam-testing/datasets/beam_postrelease_mobile_gaming/tables/leaderboard_DataflowRunner_team/insertAll?prettyPrint=false
{
  "code" : 404,
  "errors" : [ {
    "domain" : "global",
    "message" : "Not found: Table apache-beam-testing:beam_postrelease_mobile_gaming.leaderboard_DataflowRunner_team",
    "reason" : "notFound"
  } ],
  "message" : "Not found: Table apache-beam-testing:beam_postrelease_mobile_gaming.leaderboard_DataflowRunner_team",
  "status" : "NOT_FOUND"
}

Apr 11 '24 13:04 liferoad

Looks much better. Close this now.

Apr 13 '24 18:04 liferoad

Currently there is a flakiness due to downloading artifacts from maven snapshot repository not get retried. This is a maven tool thing, but probably we can first build (with retry) so the artifacts are get cached in local maven

May 21 '24 17:05 Abacn

@shunping please check this when you have time.

May 21 '24 17:05 liferoad

Related to the maven snapshot issue. I wonder if we could use artifact registry's ability to store Java packages https://cloud.google.com/artifact-registry/docs/java/store-java, instead of relying on maven central.

Jun 06 '24 20:06 damondouglas


[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project word-count-beam: An exception occured while executing the Java class. java.lang.RuntimeException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found |  
-- | --
  | [ERROR] POST https://bigquery.googleapis.com/bigquery/v2/projects/apache-beam-testing/datasets/beam_postrelease_mobile_gaming/tables/leaderboard_DirectRunner_team/insertAll?prettyPrint=false |  
  | [ERROR] { |  
  | [ERROR]   "code" : 404, |  
  | [ERROR]   "errors" : [ { |  
  | [ERROR]     "domain" : "global", |  
  | [ERROR]     "message" : "Not found: table Table is deleted: 844138762903:beam_postrelease_mobile_gaming.leaderboard_DirectRunner_team", |  
  | [ERROR]     "reason" : "notFound" |  
  | [ERROR]   } ], |  
  | [ERROR]   "message" : "Not found: table Table is deleted: 844138762903:beam_postrelease_mobile_gaming.leaderboard_DirectRunner_team", |  
  | [ERROR]   "status" : "NOT_FOUND" |  
  | [ERROR] } |  
  | [ERROR] -> [Help 1] |  
  | [ERROR]

Jun 08 '24 21:06 liferoad

Can we just add the retry to this task?

Jun 08 '24 22:06 liferoad

Looking at some of the recent failures seems like Java command was just crashing ?

https://github.com/apache/beam/actions/runs/9537373049/job/26285395593 https://ge.apache.org/s/pmba6vnub3yz4

"Process 'command '/opt/hostedtoolcache/Java_Temurin-Hotspot_jdk/8.0.412-8/x64/bin/java'' finished with non-zero exit value 1"

Jun 20 '24 17:06 chamikaramj

I also see the 404 error from BQ mentioned above in other failed runs, so seems like there are at least two failure modes.

Jun 20 '24 17:06 chamikaramj

I wonder if Java failure was due to an OOM. Can we increase the memory available to VMs running these tests ?

Jun 20 '24 17:06 chamikaramj

Trying this with #31749

Jul 02 '24 16:07 damccorm

Reopening since the workflow is still flaky

Oct 22 '24 21:10 github-actions[bot]

Green now.

Oct 29 '24 16:10 liferoad

Opening again because the workflow is broken for the past few days.

Since #33555 got in, it's been failing with the following:

Caused by: java.lang.NoClassDefFoundError: org/apache/beam/vendor/grpc/v1p60p1/io/grpc/Channel

I wonder if this is due to SDK snapshots failing (see #32161). Optimizing snapshots may fix this error.

Behind it is another error however (seen in workflow runs Jan 22-24):

{
  "code": 404,
  "errors": [
    {
      "domain": "global",
      "message": "Not found: table Table is deleted: 844138762903:beam_postrelease_mobile_gaming.leaderboard_DataflowRunner_team",
      "reason": "notFound"
    }
  ],
  "message": "Not found: table Table is deleted: 844138762903:beam_postrelease_mobile_gaming.leaderboard_DataflowRunner_team",
  "status": "NOT_FOUND"
}

Should we increase the time out again? Similar to here: https://github.com/apache/beam/pull/30949/files

Jan 27 '25 17:01 ahmedabu98

I am going to fix mobilegaming groovy scripts

Jan 28 '25 19:01 Amar3tto

Seems like this is still flaky after recent fixes unfortunately.

Feb 03 '25 23:02 chamikaramj

Hi @Amar3tto do you have insight what had caused MobileGaming test being flaky before? I read the fix mostly adding retries, which could make test less flaky, but if there is recent change causing it start failing, the root cause wasn't resolved.

Feb 04 '25 02:02 Abacn

Hi @Amar3tto do you have insight what had caused MobileGaming test being flaky before? I read the fix mostly adding retries, which could make test less flaky, but if there is recent change causing it start failing, the root cause wasn't resolved.

I tried changing the Groovy scripts, also tried explicitly creating the table, but it didn't help. I think there might be a problem with the threads, since the table name doesn't contain any unique ID (leaderboard_DataflowRunner_team). I'm going to investigate further.

Feb 11 '25 15:02 Amar3tto

Reopening since the workflow is still flaky

Mar 24 '25 18:03 github-actions[bot]

Culprit: #33086

Mar 26 '25 07:03 Amar3tto

Fixed by #34447

Mar 29 '25 15:03 Amar3tto

Reopening since the workflow is still flaky

May 18 '25 21:05 github-actions[bot]

Reopening since the workflow is still flaky

Jun 02 '25 21:06 github-actions[bot]

Currently due to connection issue to maven snapshot repository: https://issues.apache.org/jira/browse/INFRA-26230?filter=-2

Jun 03 '25 20:06 Abacn

Job ID: 2025-06-04_11_15_55-15227603184381661966

java.lang.NoSuchFieldError: java8 at org.apache.beam.runners.dataflow.worker.util.MemoryMonitor.<init>(MemoryMonitor.java:318) at org.apache.beam.runners.dataflow.worker.util.MemoryMonitor.fromOptions(MemoryMonitor.java:268) at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.fromOptions(StreamingDataflowWorker.java:444) at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.main(StreamingDataflowWorker.java:811)

@Abacn Do you have any thoughts on this error?

Jun 05 '25 12:06 Amar3tto

Need to release a new beam-master container that uses latest nightly. Let me do that

Jun 05 '25 13:06 Abacn

The same error here: #35194

Job ID: 2025-06-04_11_15_55-15227603184381661966

java.lang.NoSuchFieldError: java8 at org.apache.beam.runners.dataflow.worker.util.MemoryMonitor.<init>(MemoryMonitor.java:318) at org.apache.beam.runners.dataflow.worker.util.MemoryMonitor.fromOptions(MemoryMonitor.java:268) at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.fromOptions(StreamingDataflowWorker.java:444) at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.main(StreamingDataflowWorker.java:811)

@Abacn Do you have any thoughts on this error?

Jun 07 '25 04:06 Amar3tto