DataflowJavaSDK icon indicating copy to clipboard operation
DataflowJavaSDK copied to clipboard

Google API Client Library version 1.23.0 causes runtime problems with Dataflow Java SDK

Open moandcompany opened this issue 8 years ago • 19 comments
trafficstars

The new Google API Client Library, version 1.23.0, appears to cause problems with the Dataflow Java SDK when submitting and/or running jobs.

This appears to affect Dataflow Java SDKs in both major version families (e.g. 1.9.1, 2.0.0, and 2.1.0)

In some cases, these problems manifest as 404 HTTP errors when attempting to upload staging files

Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.io.IOException: Error executing batch GCS request :userprofile:run
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:322)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:292)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:200)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:63)
(...)

Caused by: java.util.concurrent.ExecutionException: com.google.api.client.http.HttpResponseException: 404 Not Found
Not Found
at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:500)
at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:479)
at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:76)
at org.apache.beam.sdk.util.GcsUtil.executeBatches(GcsUtil.java:611)
at org.apache.beam.sdk.util.GcsUtil.getObjects(GcsUtil.java:358)
at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.matchNonGlobs(GcsFileSystem.java:217)
at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.match(GcsFileSystem.java:86)
(...)

Caused by: com.google.api.client.http.HttpResponseException: 404 Not Found
Not Found
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1070)
at com.google.api.client.googleapis.batch.BatchRequest.execute(BatchRequest.java:241)
at org.apache.beam.sdk.util.GcsUtil$3.call(GcsUtil.java:604)
at org.apache.beam.sdk.util.GcsUtil$3.call(GcsUtil.java:602)
at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)

Workaround: Pinning Google API Client Library dependencies to version 1.22.0 appears to avoid this issue

  • com.google.api-client:google-api-client:1.22.0

Gradle Example:

compile ('com.google.api-client:google-api-client:1.22.0') {
        force = true
    }

Maven Example:

<dependency>
  <groupId>com.google.api-client</groupId>
  <artifactId>google-api-client</artifactId>
  <version>[1.22.0]</version>
</dependency>

moandcompany avatar Oct 05 '17 19:10 moandcompany

We've had the same problem. Except for us, it was with the the BigQuery API that we were bringing into our project. Removing it fixed it (Beam has a dependancy in it anyway).

polleyg avatar Oct 11 '17 01:10 polleyg

We're also experiencing issues during file staging. Before the attempt to upload files is made, we receive this error: WARNING: Request failed with code 409, performed 0 retries due to IOExceptions, performed 0 retries due to unsuccessful status codes, HTTP framework says request can be retried, (caller responsible for retrying): https://www.googleapis.com/storage/v1/b?predefinedAcl=projectPrivate&predefinedDefaultObjectAcl=projectPrivate&project=<project name omitted>

Accessing the HTTP resource specified will return JSON data, within which there is an error with message Anonymous users does not have storage.buckets.list access to project <project number omitted>.

pheromonez avatar Oct 11 '17 04:10 pheromonez

We had the same issue and we can confirm that as @moandcompany suggest, this fixes it:

compile ('com.google.api-client:google-api-client:1.22.0') {
        force = true
    }

For the record, our stack trace is pretty similar. We are running 2.2.0 snapshot version of apache beam:

java.io.IOException: Error executing batch GCS request
        at org.apache.beam.sdk.util.GcsUtil.executeBatches(GcsUtil.java:603)
        at org.apache.beam.sdk.util.GcsUtil.getObjects(GcsUtil.java:342)
        at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.matchNonGlobs(GcsFileSystem.java:217)
        at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.match(GcsFileSystem.java:86)
        at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:125)
        at org.apache.beam.sdk.io.FileSystems.matchSingleFileSpec(FileSystems.java:190)
        at org.apache.beam.runners.dataflow.util.PackageUtil.alreadyStaged(PackageUtil.java:159)
        at org.apache.beam.runners.dataflow.util.PackageUtil.stagePackageSynchronously(PackageUtil.java:188)
        at org.apache.beam.runners.dataflow.util.PackageUtil.access$000(PackageUtil.java:69)
        at org.apache.beam.runners.dataflow.util.PackageUtil$2.call(PackageUtil.java:176)
        at org.apache.beam.runners.dataflow.util.PackageUtil$2.call(PackageUtil.java:173)
        at org.apache.beam.runners.dataflow.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
        at org.apache.beam.runners.dataflow.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
        at org.apache.beam.runners.dataflow.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: com.google.api.client.http.HttpResponseException: 404 Not Found
Not Found
        at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:500)
        at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:459)
        at org.apache.beam.sdks.java.extensions.google.cloud.platform.core.repackaged.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:76)
        at org.apache.beam.sdk.util.GcsUtil.executeBatches(GcsUtil.java:595)
        ... 16 more

afcastano avatar Oct 11 '17 09:10 afcastano

I got similar problem. Here's the API response.

Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "message" : "(249a6f2653c550b0): The workflow was automatically rejected by the service because it may trigger an identified bug in the SDK.\nBug details: com.google.api-client:google-api-client library version 1.23.0 is not supported..\nContact [email protected] for further help. Please use this identifier in your communication: 67379331.",
    "reason" : "badRequest"
  } ],
  "message" : "(249a6f2653c550b0): The workflow was automatically rejected by the service because it may trigger an identified bug in the SDK.\nBug details: com.google.api-client:google-api-client library version 1.23.0 is not supported..\nContact [email protected] for further help. Please use this identifier in your communication: 67379331.",
  "status" : "INVALID_ARGUMENT"
}

zinuzoid avatar Oct 16 '17 04:10 zinuzoid

Google added support to reject jobs from being created with this issue to prevent users from starting malformed jobs.

lukecwik avatar Oct 16 '17 16:10 lukecwik

The root cause for the 404's is outlined at https://github.com/google/google-api-java-client/issues/1073. Hilariously, you can't get to the error rejecting the job for bad dependencies until you've cleared up the staging problem (in our case by upgrading to com.google.apis:google-api-services-storage:v1-rev115-1.23.0 ). Is there another problem that's causing the job rejection? We're being forced to 1.23.0 by a bug in another Google API so this puts us between a rock and a hard place because lol @ Java versioning on Maven.

frew avatar Nov 14 '17 00:11 frew

+1 happening to us too. Is there any suggested remedy?

Jdban avatar Dec 04 '17 19:12 Jdban

The Cloud Dataflow team has added a page on Dataflow SDK and Worker Dependencies that identifies the google-api-client 1.22.0 version requirement (Java)

moandcompany avatar Dec 12 '17 01:12 moandcompany

The Cloud Dataflow team has added a page on Dataflow SDK and Worker Dependencies that identifies the google-api-client 1.22.0 version requirement (Java)

That is a useful link, but not really a solution for those of us like @frew who need to use google-api-client 1.23.0 due to a bug in another library

Jdban avatar Dec 12 '17 15:12 Jdban

I also have this issue

sgri avatar Jan 16 '18 12:01 sgri

any updates? Im running into this issue

ghost avatar Jan 29 '18 01:01 ghost

same here. apache beam 2.3.0 with dataflowrunner having the same 404 error. A permanent fix would be ideal.

Thanks.

alan-ma-umg avatar Feb 26 '18 05:02 alan-ma-umg

We encountered this as well. We're on Scio 0.5.5-beta1 and attempted to force the version to 1.2.2 using Overrides never worked. However, explicitly adding this library with a force() did work, i.e.,

"com.google.api-client" % "google-api-client" % "1.22.0" force()

dsquier avatar Mar 09 '18 16:03 dsquier

I have the same problem. Google forces moving out of storage@v1. Add <groupId>com.google.apis</groupId> <artifactId>google-api-services-storage</artifactId> v1-rev115-1.23.0 The runtime error becomes Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.NoClassDefFoundError: com/google/api/gax/rpc/HeaderProvider It looks libraries conflict across Google's infrastructure libraries. Horrible.

gfengster avatar Mar 30 '18 14:03 gfengster

@dsquier omg thank you. I was battling dependencyOverrides for a while and didn't think about force.

andrewcassidy avatar May 02 '18 23:05 andrewcassidy

I was redirected here from google because I was using the bigquery-client library and the same error appeared. Does anybody found a workaround to this issue? I've tried (without success)

    <dependency>
      <groupId>com.google.cloud</groupId>
      <artifactId>google-cloud-bigquery</artifactId>
      <version>0.21.0-beta</version>
    </dependency>

pabloazurduy avatar May 23 '18 20:05 pabloazurduy

After analyzing my dependencies and checking the error, I was able to fix this by forcing the version of google-api-services-dataflow to v1b3-rev221-1.22.0 (and of course setting google-api-client to version 1.22.0)

Only setting google-api-client to the old version wasn't enough for me since I had the following error thrown:


java.io.IOException: Error executing batch GCS request
        at org.apache.beam.sdk.util.GcsUtil.executeBatches(GcsUt

when trying to compile my dataflow template

pievis avatar Jun 06 '18 15:06 pievis

For anyone else still seeing issues like this, check out the version numbers here and make sure you aren't importing a conflicting dependency.

vinnybod avatar Jun 26 '18 22:06 vinnybod

Now Beam 2.5.0 depends on google-api-client:1.23.0, see https://cloud.google.com/dataflow/docs/concepts/sdk-worker-dependencies. Is this still an issue?

labianchin avatar Aug 13 '18 11:08 labianchin