beam icon indicating copy to clipboard operation
beam copied to clipboard

The PostRelease Nightly Snapshot job is flaky

Open github-actions[bot] opened this issue 1 year ago • 12 comments

The PostRelease Nightly Snapshot is failing over 50% of the time Please visit https://github.com/apache/beam/actions/workflows/beam_PostRelease_NightlySnapshot.yml?query=is%3Afailure+branch%3Amaster to see the logs.

github-actions[bot] avatar Mar 05 '24 18:03 github-actions[bot]

Related to ##30447

shunping avatar Mar 11 '24 20:03 shunping

Still failing:

Container image gcr.io/cloud-dataflow/v1beta3/beam_java8_sdk:beam-master-20240306 not downloaded yet.

It is strange that the container gets resolved to "beam_java8_sdk:beam-master-20240306". What happens is it picks the label for legacy runner but actually trying to pull runner v2 image. This is likely due to Dataflow switched to runner v2 by default in Beam 2.55.0+

https://github.com/apache/beam/blob/ef919e2603fcd6bffde2a15961d1f186448520a9/runners/google-cloud-dataflow-java/build.gradle#L54-L55

entered #30634

Abacn avatar Mar 14 '24 14:03 Abacn

https://github.com/apache/beam/actions/runs/8619063045

java.lang.RuntimeException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
POST https://bigquery.googleapis.com/bigquery/v2/projects/apache-beam-testing/datasets/beam_postrelease_mobile_gaming/tables/leaderboard_DataflowRunner_team/insertAll?prettyPrint=false
{
  "code" : 404,
  "errors" : [ {
    "domain" : "global",
    "message" : "Not found: Table apache-beam-testing:beam_postrelease_mobile_gaming.leaderboard_DataflowRunner_team",
    "reason" : "notFound"
  } ],
  "message" : "Not found: Table apache-beam-testing:beam_postrelease_mobile_gaming.leaderboard_DataflowRunner_team",
  "status" : "NOT_FOUND"
}

liferoad avatar Apr 11 '24 13:04 liferoad

Looks much better. Close this now.

liferoad avatar Apr 13 '24 18:04 liferoad

Currently there is a flakiness due to downloading artifacts from maven snapshot repository not get retried. This is a maven tool thing, but probably we can first build (with retry) so the artifacts are get cached in local maven

Abacn avatar May 21 '24 17:05 Abacn

@shunping please check this when you have time.

liferoad avatar May 21 '24 17:05 liferoad

Related to the maven snapshot issue. I wonder if we could use artifact registry's ability to store Java packages https://cloud.google.com/artifact-registry/docs/java/store-java, instead of relying on maven central.

damondouglas avatar Jun 06 '24 20:06 damondouglas


[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project word-count-beam: An exception occured while executing the Java class. java.lang.RuntimeException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found |  
-- | --
  | [ERROR] POST https://bigquery.googleapis.com/bigquery/v2/projects/apache-beam-testing/datasets/beam_postrelease_mobile_gaming/tables/leaderboard_DirectRunner_team/insertAll?prettyPrint=false |  
  | [ERROR] { |  
  | [ERROR]   "code" : 404, |  
  | [ERROR]   "errors" : [ { |  
  | [ERROR]     "domain" : "global", |  
  | [ERROR]     "message" : "Not found: table Table is deleted: 844138762903:beam_postrelease_mobile_gaming.leaderboard_DirectRunner_team", |  
  | [ERROR]     "reason" : "notFound" |  
  | [ERROR]   } ], |  
  | [ERROR]   "message" : "Not found: table Table is deleted: 844138762903:beam_postrelease_mobile_gaming.leaderboard_DirectRunner_team", |  
  | [ERROR]   "status" : "NOT_FOUND" |  
  | [ERROR] } |  
  | [ERROR] -> [Help 1] |  
  | [ERROR]


liferoad avatar Jun 08 '24 21:06 liferoad

Can we just add the retry to this task?

liferoad avatar Jun 08 '24 22:06 liferoad

Looking at some of the recent failures seems like Java command was just crashing ?

https://github.com/apache/beam/actions/runs/9537373049/job/26285395593 https://ge.apache.org/s/pmba6vnub3yz4

"Process 'command '/opt/hostedtoolcache/Java_Temurin-Hotspot_jdk/8.0.412-8/x64/bin/java'' finished with non-zero exit value 1"

chamikaramj avatar Jun 20 '24 17:06 chamikaramj

I also see the 404 error from BQ mentioned above in other failed runs, so seems like there are at least two failure modes.

chamikaramj avatar Jun 20 '24 17:06 chamikaramj

I wonder if Java failure was due to an OOM. Can we increase the memory available to VMs running these tests ?

chamikaramj avatar Jun 20 '24 17:06 chamikaramj

Trying this with #31749

damccorm avatar Jul 02 '24 16:07 damccorm

Reopening since the workflow is still flaky

github-actions[bot] avatar Oct 22 '24 21:10 github-actions[bot]

Green now.

liferoad avatar Oct 29 '24 16:10 liferoad

Opening again because the workflow is broken for the past few days.

Since #33555 got in, it's been failing with the following:

Caused by: java.lang.NoClassDefFoundError: org/apache/beam/vendor/grpc/v1p60p1/io/grpc/Channel

I wonder if this is due to SDK snapshots failing (see #32161). Optimizing snapshots may fix this error.

Behind it is another error however (seen in workflow runs Jan 22-24):

{
  "code": 404,
  "errors": [
    {
      "domain": "global",
      "message": "Not found: table Table is deleted: 844138762903:beam_postrelease_mobile_gaming.leaderboard_DataflowRunner_team",
      "reason": "notFound"
    }
  ],
  "message": "Not found: table Table is deleted: 844138762903:beam_postrelease_mobile_gaming.leaderboard_DataflowRunner_team",
  "status": "NOT_FOUND"
}

Should we increase the time out again? Similar to here: https://github.com/apache/beam/pull/30949/files

ahmedabu98 avatar Jan 27 '25 17:01 ahmedabu98

I am going to fix mobilegaming groovy scripts

Amar3tto avatar Jan 28 '25 19:01 Amar3tto

Seems like this is still flaky after recent fixes unfortunately.

chamikaramj avatar Feb 03 '25 23:02 chamikaramj

Hi @Amar3tto do you have insight what had caused MobileGaming test being flaky before? I read the fix mostly adding retries, which could make test less flaky, but if there is recent change causing it start failing, the root cause wasn't resolved.

Abacn avatar Feb 04 '25 02:02 Abacn

Hi @Amar3tto do you have insight what had caused MobileGaming test being flaky before? I read the fix mostly adding retries, which could make test less flaky, but if there is recent change causing it start failing, the root cause wasn't resolved.

I tried changing the Groovy scripts, also tried explicitly creating the table, but it didn't help. I think there might be a problem with the threads, since the table name doesn't contain any unique ID (leaderboard_DataflowRunner_team). I'm going to investigate further.

Amar3tto avatar Feb 11 '25 15:02 Amar3tto

Reopening since the workflow is still flaky

github-actions[bot] avatar Mar 24 '25 18:03 github-actions[bot]

Culprit: #33086

Amar3tto avatar Mar 26 '25 07:03 Amar3tto

Fixed by #34447

Amar3tto avatar Mar 29 '25 15:03 Amar3tto

Reopening since the workflow is still flaky

github-actions[bot] avatar May 18 '25 21:05 github-actions[bot]

Reopening since the workflow is still flaky

github-actions[bot] avatar Jun 02 '25 21:06 github-actions[bot]

Currently due to connection issue to maven snapshot repository: https://issues.apache.org/jira/browse/INFRA-26230?filter=-2

Abacn avatar Jun 03 '25 20:06 Abacn

Job ID: 2025-06-04_11_15_55-15227603184381661966

java.lang.NoSuchFieldError: java8 at org.apache.beam.runners.dataflow.worker.util.MemoryMonitor.<init>(MemoryMonitor.java:318) at org.apache.beam.runners.dataflow.worker.util.MemoryMonitor.fromOptions(MemoryMonitor.java:268) at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.fromOptions(StreamingDataflowWorker.java:444) at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.main(StreamingDataflowWorker.java:811)

@Abacn Do you have any thoughts on this error?

Amar3tto avatar Jun 05 '25 12:06 Amar3tto

Need to release a new beam-master container that uses latest nightly. Let me do that

Abacn avatar Jun 05 '25 13:06 Abacn

The same error here: #35194

Job ID: 2025-06-04_11_15_55-15227603184381661966

java.lang.NoSuchFieldError: java8 at org.apache.beam.runners.dataflow.worker.util.MemoryMonitor.<init>(MemoryMonitor.java:318) at org.apache.beam.runners.dataflow.worker.util.MemoryMonitor.fromOptions(MemoryMonitor.java:268) at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.fromOptions(StreamingDataflowWorker.java:444) at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.main(StreamingDataflowWorker.java:811)

@Abacn Do you have any thoughts on this error?

Amar3tto avatar Jun 07 '25 04:06 Amar3tto