DataflowTemplates icon indicating copy to clipboard operation
DataflowTemplates copied to clipboard

[Bug]: Flaky Spanner Integration test due to "Could not initialize class com.google.spanner.v1.Session$LabelsDefault"

Open Abacn opened this issue 10 months ago • 15 comments

Related Template(s)

Spanner templates

Template Version

N/A

What happened?

Spanner PR is flaky, integration test job failed launch. Example error:

2025-02-13 13:52:51.974 EST
java.lang.NoClassDefFoundError: Could not initialize class com.google.spanner.v1.Session$LabelsDefaultEntryHolder
2025-02-13 13:52:51.974 EST
at com.google.spanner.v1.Session.internalGetLabels(Session.java:147)
2025-02-13 13:52:51.974 EST
at com.google.spanner.v1.Session.getSerializedSize(Session.java:490)
2025-02-13 13:52:51.974 EST
at com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:861)
2025-02-13 13:52:51.974 EST
at com.google.protobuf.CodedOutputStream.computeMessageSize(CodedOutputStream.java:641)
2025-02-13 13:52:51.974 EST
at com.google.spanner.v1.BatchCreateSessionsRequest.getSerializedSize(BatchCreateSessionsRequest.java:232)
2025-02-13 13:52:51.974 EST
at io.grpc.protobuf.lite.ProtoInputStream.available(ProtoInputStream.java:108)
2025-02-13 13:52:51.974 EST
at io.grpc.internal.MessageFramer.getKnownLength(MessageFramer.java:204)
2025-02-13 13:52:51.974 EST
at io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:139)
2025-02-13 13:52:51.974 EST
at io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:66)
2025-02-13 13:52:51.974 EST
at io.grpc.internal.ForwardingClientStream.writeMessage(ForwardingClientStream.java:37)
2025-02-13 13:52:51.974 EST
at io.grpc.internal.DelayedStream$6.run(DelayedStream.java:282)
2025-02-13 13:52:51.974 EST
at io.grpc.internal.DelayedStream.drainPendingCalls(DelayedStream.java:182)
2025-02-13 13:52:51.975 EST
at io.grpc.internal.DelayedStream.access$100(DelayedStream.java:44)
2025-02-13 13:52:51.975 EST
at io.grpc.internal.DelayedStream$4.run(DelayedStream.java:148)
...
2025-02-13 13:52:51.978 EST
Caused by: java.lang.ExceptionInInitializerError: Exception java.lang.ExceptionInInitializerError [in thread "grpc-default-executor-0"]
2025-02-13 13:52:51.978 EST
at com.google.spanner.v1.Session$LabelsDefaultEntryHolder.<clinit>(Session.java:132)

different tests failed same reason each time. See #2177 for example.

Relevant log output


Abacn avatar Feb 13 '25 19:02 Abacn

NoClassDef found should happen everytime the same tests are run. I don't understand why this is flaky.

Is there any difference between the way we run the tests in the Java PR workflow and Spanner PR workflow?

The Spanner PR continues workflow has one flaky test SpannerToSourceDbCustomTransformationIT which we are working on.

Example: https://github.com/GoogleCloudPlatform/DataflowTemplates/actions/runs/13307357883/job/37161942486 https://github.com/GoogleCloudPlatform/DataflowTemplates/actions/workflows/spanner-pr.yml?query=branch%3Amain

darshan-sj avatar Feb 14 '25 06:02 darshan-sj

Thanks for the comment.

NoClassDef found should happen everytime the same tests are run. I don't understand why this is flaky.

NoClassDef could also happen when the static initializer fails, in this case it is in com.google.spanner.v1.Session$LabelsDefaultEntryHolder, there is a static block

static {
      defaultEntry = MapEntry.newDefaultInstance(...);
    }

likely failed

Abacn avatar Feb 14 '25 14:02 Abacn

Looks like this has been fixed. There are other issues causing problems now, but I think they're being addressed

damccorm avatar Mar 05 '25 15:03 damccorm

I have noticed multiple occurences of this recently. These are documented in b/400992122

Deep1998 avatar Mar 06 '25 11:03 Deep1998

Most recent occurrence of this issue - https://github.com/GoogleCloudPlatform/DataflowTemplates/pull/2250/checks?check_run_id=38885011121

manitgupta avatar Mar 17 '25 12:03 manitgupta

I don't think this is related to this project at all. However, this is the only mention of it that I could find on the Internet, so I report here. We started noticing this recently after bumping our deps to the latest BOM.

Here's the root cause, I think (well, not the root cause but original exception in a class initializer that leads to following NoClassDefFound errors).

Can there be a race of some sort in a new Java v4 protobuf lib?

Exception in thread "grpc-default-executor-0" java.lang.ExceptionInInitializerError
    at com.google.spanner.v1.Session$LabelsDefaultEntryHolder.<clinit>(Session.java:132)
    at com.google.spanner.v1.Session.internalGetLabels(Session.java:147)
    at com.google.spanner.v1.Session.getSerializedSize(Session.java:490)
    at com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:860)
    at com.google.protobuf.CodedOutputStream.computeMessageSize(CodedOutputStream.java:640)
    at com.google.spanner.v1.BatchCreateSessionsRequest.getSerializedSize(BatchCreateSessionsRequest.java:232)
    at io.grpc.protobuf.lite.ProtoInputStream.available(ProtoInputStream.java:108)
    at io.grpc.internal.MessageFramer.getKnownLength(MessageFramer.java:204)
    at io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:139)
    at io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:70)
    at io.grpc.internal.ForwardingClientStream.writeMessage(ForwardingClientStream.java:37)
    at io.grpc.internal.DelayedStream$6.run(DelayedStream.java:282)
    at io.grpc.internal.DelayedStream.drainPendingCalls(DelayedStream.java:182)
    at io.grpc.internal.DelayedStream.access$100(DelayedStream.java:44)
    at io.grpc.internal.DelayedStream$4.run(DelayedStream.java:148)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.lang.NullPointerException: Cannot invoke "com.google.protobuf.DescriptorProtos$FeatureSet.getExtension(com.google.protobuf.ExtensionLite)" because the return value of "com.google.protobuf.Descriptors$FieldDescriptor.getFeatures()" is null
    at com.google.protobuf.Descriptors$FieldDescriptor.needsUtf8Check(Descriptors.java:1325)
    at com.google.protobuf.MessageReflection$ExtensionBuilderAdapter.getUtf8Validation(MessageReflection.java:1077)
    at com.google.protobuf.MessageReflection.mergeFieldFrom(MessageReflection.java:1236)
    at com.google.protobuf.GeneratedMessage$ExtendableBuilder.parseUnknownField(GeneratedMessage.java:1632)
    at com.google.protobuf.DescriptorProtos$MethodOptions$Builder.mergeFrom(DescriptorProtos.java:36438)
    at com.google.protobuf.DescriptorProtos$MethodOptions$Builder.mergeFrom(DescriptorProtos.java:36196)
    at com.google.protobuf.CodedInputStream$ArrayDecoder.readMessage(CodedInputStream.java:853)
    at com.google.protobuf.DescriptorProtos$MethodDescriptorProto$Builder.mergeFrom(DescriptorProtos.java:20309)
    at com.google.protobuf.DescriptorProtos$MethodDescriptorProto$1.parsePartialFrom(DescriptorProtos.java:20805)
    at com.google.protobuf.DescriptorProtos$MethodDescriptorProto$1.parsePartialFrom(DescriptorProtos.java:20797)
    at com.google.protobuf.CodedInputStream$ArrayDecoder.readMessage(CodedInputStream.java:869)
    at com.google.protobuf.DescriptorProtos$ServiceDescriptorProto$Builder.mergeFrom(DescriptorProtos.java:18995)
    at com.google.protobuf.DescriptorProtos$ServiceDescriptorProto$1.parsePartialFrom(DescriptorProtos.java:19493)
    at com.google.protobuf.DescriptorProtos$ServiceDescriptorProto$1.parsePartialFrom(DescriptorProtos.java:19485)
    at com.google.protobuf.CodedInputStream$ArrayDecoder.readMessage(CodedInputStream.java:869)
    at com.google.protobuf.DescriptorProtos$FileDescriptorProto$Builder.mergeFrom(DescriptorProtos.java:2599)
    at com.google.protobuf.DescriptorProtos$FileDescriptorProto$1.parsePartialFrom(DescriptorProtos.java:4487)
    at com.google.protobuf.DescriptorProtos$FileDescriptorProto$1.parsePartialFrom(DescriptorProtos.java:4479)
    at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:77)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:97)
    at com.google.protobuf.DescriptorProtos$FileDescriptorProto$1.parseFrom(DescriptorProtos.java:4479)
    at com.google.protobuf.DescriptorProtos$FileDescriptorProto.parseFrom(DescriptorProtos.java:2052)
    at com.google.protobuf.Descriptors$FileDescriptor.internalUpdateFileDescriptor(Descriptors.java:505)
    at com.google.spanner.v1.SpannerProto.<clinit>(SpannerProto.java:786)

kberezin-nshl avatar Mar 19 '25 11:03 kberezin-nshl

Also, weirdly enough, we ONLY see this problem when our jobs are running in us-central1. I have no idea how to explain that, but this is true. I can see that in your case @manitgupta it was also us-central1.

kberezin-nshl avatar Mar 19 '25 11:03 kberezin-nshl

Here's the related issue: https://github.com/protocolbuffers/protobuf/issues/20599

kberezin-nshl avatar Mar 19 '25 12:03 kberezin-nshl

We have found a workaround for that, if you guys are interested. Basically it is this class:

import com.google.auto.service.AutoService;
import com.google.spanner.v1.SpannerProto;
import lombok.extern.slf4j.Slf4j;
import org.apache.beam.sdk.harness.JvmInitializer;

@AutoService(JvmInitializer.class)
public final class ProtobufWorkaround implements JvmInitializer {
  static {
    SpannerProto.getDescriptor();
    // add more calls to .getDescriptor() for the protobufs in question, if needed
  }

  @Override
  public void onStartup() {
    System.out.println("Workaround applied");
  }
}

kberezin-nshl avatar Mar 19 '25 15:03 kberezin-nshl

In https://github.com/protocolbuffers/protobuf/issues/20599, there is a comment

This should only impact OSS users using old generated code (<26.x) with new runtime (>= 28.x) which uses lazy feature resolution and triggers a race in double-check locking.

Does this mean Spanner proto is currently using an older protoc? If so would Spanner team consider bumping the protoc version for their released client?

Abacn avatar Mar 19 '25 16:03 Abacn

That is a very good question. I have no idea why all Cloud libraries have migrated to 4.x protobuf libraries without rebuilding their respective protobufs.

That is just another example of extremely poor management and work coordination related to Java Proto 4.x by Google. First binary incompatibility (the decision that was later reversed), now this. If I were in their shoes, I would have a very hard look in the mirror, but thankfully I don't work there.

However, these mishaps forced me to waste a full workday yesterday to find out what is going on and how to fix that as it started to affect our business, despite this issue being reported over a month ago. I will think twice when I want to update Beam/GCP library versions in the future.

kberezin-nshl avatar Mar 20 '25 09:03 kberezin-nshl

This has been fixed at head by #2261, so I'm going to close this issue. This should be released in the later part of April

damccorm avatar Mar 27 '25 13:03 damccorm

Hello, we have recently updated our dataflow pipelines to Apache Beam 2.64.0 as of a week ago and we are still seeing the same errors reported by @kberezin-nshl causing issues with our pipelines. Based on the release notes for release candidates, this fix should have been released 3 weeks ago. We built and deployed a new flex template to update the pipelines last week. Are there still issues with the template build?

tabularasa7 avatar Apr 30 '25 20:04 tabularasa7

Reopen this given https://github.com/GoogleCloudPlatform/DataflowTemplates/issues/2191#issuecomment-2843226343

liferoad avatar Jun 23 '25 13:06 liferoad

Is this still happening? I tried to look at a few recent runs, including the RC validation, and I could not find an occurrence.

kennknowles avatar Sep 22 '25 18:09 kennknowles