[Bug]: Flaky Spanner Integration test due to "Could not initialize class com.google.spanner.v1.Session$LabelsDefault"
Related Template(s)
Spanner templates
Template Version
N/A
What happened?
Spanner PR is flaky, integration test job failed launch. Example error:
2025-02-13 13:52:51.974 EST
java.lang.NoClassDefFoundError: Could not initialize class com.google.spanner.v1.Session$LabelsDefaultEntryHolder
2025-02-13 13:52:51.974 EST
at com.google.spanner.v1.Session.internalGetLabels(Session.java:147)
2025-02-13 13:52:51.974 EST
at com.google.spanner.v1.Session.getSerializedSize(Session.java:490)
2025-02-13 13:52:51.974 EST
at com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:861)
2025-02-13 13:52:51.974 EST
at com.google.protobuf.CodedOutputStream.computeMessageSize(CodedOutputStream.java:641)
2025-02-13 13:52:51.974 EST
at com.google.spanner.v1.BatchCreateSessionsRequest.getSerializedSize(BatchCreateSessionsRequest.java:232)
2025-02-13 13:52:51.974 EST
at io.grpc.protobuf.lite.ProtoInputStream.available(ProtoInputStream.java:108)
2025-02-13 13:52:51.974 EST
at io.grpc.internal.MessageFramer.getKnownLength(MessageFramer.java:204)
2025-02-13 13:52:51.974 EST
at io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:139)
2025-02-13 13:52:51.974 EST
at io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:66)
2025-02-13 13:52:51.974 EST
at io.grpc.internal.ForwardingClientStream.writeMessage(ForwardingClientStream.java:37)
2025-02-13 13:52:51.974 EST
at io.grpc.internal.DelayedStream$6.run(DelayedStream.java:282)
2025-02-13 13:52:51.974 EST
at io.grpc.internal.DelayedStream.drainPendingCalls(DelayedStream.java:182)
2025-02-13 13:52:51.975 EST
at io.grpc.internal.DelayedStream.access$100(DelayedStream.java:44)
2025-02-13 13:52:51.975 EST
at io.grpc.internal.DelayedStream$4.run(DelayedStream.java:148)
...
2025-02-13 13:52:51.978 EST
Caused by: java.lang.ExceptionInInitializerError: Exception java.lang.ExceptionInInitializerError [in thread "grpc-default-executor-0"]
2025-02-13 13:52:51.978 EST
at com.google.spanner.v1.Session$LabelsDefaultEntryHolder.<clinit>(Session.java:132)
different tests failed same reason each time. See #2177 for example.
Relevant log output
NoClassDef found should happen everytime the same tests are run. I don't understand why this is flaky.
Is there any difference between the way we run the tests in the Java PR workflow and Spanner PR workflow?
The Spanner PR continues workflow has one flaky test SpannerToSourceDbCustomTransformationIT which we are working on.
Example: https://github.com/GoogleCloudPlatform/DataflowTemplates/actions/runs/13307357883/job/37161942486 https://github.com/GoogleCloudPlatform/DataflowTemplates/actions/workflows/spanner-pr.yml?query=branch%3Amain
Thanks for the comment.
NoClassDef found should happen everytime the same tests are run. I don't understand why this is flaky.
NoClassDef could also happen when the static initializer fails, in this case it is in com.google.spanner.v1.Session$LabelsDefaultEntryHolder, there is a static block
static {
defaultEntry = MapEntry.newDefaultInstance(...);
}
likely failed
Looks like this has been fixed. There are other issues causing problems now, but I think they're being addressed
I have noticed multiple occurences of this recently. These are documented in b/400992122
Most recent occurrence of this issue - https://github.com/GoogleCloudPlatform/DataflowTemplates/pull/2250/checks?check_run_id=38885011121
I don't think this is related to this project at all. However, this is the only mention of it that I could find on the Internet, so I report here. We started noticing this recently after bumping our deps to the latest BOM.
Here's the root cause, I think (well, not the root cause but original exception in a class initializer that leads to following NoClassDefFound errors).
Can there be a race of some sort in a new Java v4 protobuf lib?
Exception in thread "grpc-default-executor-0" java.lang.ExceptionInInitializerError
at com.google.spanner.v1.Session$LabelsDefaultEntryHolder.<clinit>(Session.java:132)
at com.google.spanner.v1.Session.internalGetLabels(Session.java:147)
at com.google.spanner.v1.Session.getSerializedSize(Session.java:490)
at com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:860)
at com.google.protobuf.CodedOutputStream.computeMessageSize(CodedOutputStream.java:640)
at com.google.spanner.v1.BatchCreateSessionsRequest.getSerializedSize(BatchCreateSessionsRequest.java:232)
at io.grpc.protobuf.lite.ProtoInputStream.available(ProtoInputStream.java:108)
at io.grpc.internal.MessageFramer.getKnownLength(MessageFramer.java:204)
at io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:139)
at io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:70)
at io.grpc.internal.ForwardingClientStream.writeMessage(ForwardingClientStream.java:37)
at io.grpc.internal.DelayedStream$6.run(DelayedStream.java:282)
at io.grpc.internal.DelayedStream.drainPendingCalls(DelayedStream.java:182)
at io.grpc.internal.DelayedStream.access$100(DelayedStream.java:44)
at io.grpc.internal.DelayedStream$4.run(DelayedStream.java:148)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.lang.NullPointerException: Cannot invoke "com.google.protobuf.DescriptorProtos$FeatureSet.getExtension(com.google.protobuf.ExtensionLite)" because the return value of "com.google.protobuf.Descriptors$FieldDescriptor.getFeatures()" is null
at com.google.protobuf.Descriptors$FieldDescriptor.needsUtf8Check(Descriptors.java:1325)
at com.google.protobuf.MessageReflection$ExtensionBuilderAdapter.getUtf8Validation(MessageReflection.java:1077)
at com.google.protobuf.MessageReflection.mergeFieldFrom(MessageReflection.java:1236)
at com.google.protobuf.GeneratedMessage$ExtendableBuilder.parseUnknownField(GeneratedMessage.java:1632)
at com.google.protobuf.DescriptorProtos$MethodOptions$Builder.mergeFrom(DescriptorProtos.java:36438)
at com.google.protobuf.DescriptorProtos$MethodOptions$Builder.mergeFrom(DescriptorProtos.java:36196)
at com.google.protobuf.CodedInputStream$ArrayDecoder.readMessage(CodedInputStream.java:853)
at com.google.protobuf.DescriptorProtos$MethodDescriptorProto$Builder.mergeFrom(DescriptorProtos.java:20309)
at com.google.protobuf.DescriptorProtos$MethodDescriptorProto$1.parsePartialFrom(DescriptorProtos.java:20805)
at com.google.protobuf.DescriptorProtos$MethodDescriptorProto$1.parsePartialFrom(DescriptorProtos.java:20797)
at com.google.protobuf.CodedInputStream$ArrayDecoder.readMessage(CodedInputStream.java:869)
at com.google.protobuf.DescriptorProtos$ServiceDescriptorProto$Builder.mergeFrom(DescriptorProtos.java:18995)
at com.google.protobuf.DescriptorProtos$ServiceDescriptorProto$1.parsePartialFrom(DescriptorProtos.java:19493)
at com.google.protobuf.DescriptorProtos$ServiceDescriptorProto$1.parsePartialFrom(DescriptorProtos.java:19485)
at com.google.protobuf.CodedInputStream$ArrayDecoder.readMessage(CodedInputStream.java:869)
at com.google.protobuf.DescriptorProtos$FileDescriptorProto$Builder.mergeFrom(DescriptorProtos.java:2599)
at com.google.protobuf.DescriptorProtos$FileDescriptorProto$1.parsePartialFrom(DescriptorProtos.java:4487)
at com.google.protobuf.DescriptorProtos$FileDescriptorProto$1.parsePartialFrom(DescriptorProtos.java:4479)
at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:77)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:97)
at com.google.protobuf.DescriptorProtos$FileDescriptorProto$1.parseFrom(DescriptorProtos.java:4479)
at com.google.protobuf.DescriptorProtos$FileDescriptorProto.parseFrom(DescriptorProtos.java:2052)
at com.google.protobuf.Descriptors$FileDescriptor.internalUpdateFileDescriptor(Descriptors.java:505)
at com.google.spanner.v1.SpannerProto.<clinit>(SpannerProto.java:786)
Also, weirdly enough, we ONLY see this problem when our jobs are running in us-central1. I have no idea how to explain that, but this is true. I can see that in your case @manitgupta it was also us-central1.
Here's the related issue: https://github.com/protocolbuffers/protobuf/issues/20599
We have found a workaround for that, if you guys are interested. Basically it is this class:
import com.google.auto.service.AutoService;
import com.google.spanner.v1.SpannerProto;
import lombok.extern.slf4j.Slf4j;
import org.apache.beam.sdk.harness.JvmInitializer;
@AutoService(JvmInitializer.class)
public final class ProtobufWorkaround implements JvmInitializer {
static {
SpannerProto.getDescriptor();
// add more calls to .getDescriptor() for the protobufs in question, if needed
}
@Override
public void onStartup() {
System.out.println("Workaround applied");
}
}
In https://github.com/protocolbuffers/protobuf/issues/20599, there is a comment
This should only impact OSS users using old generated code (<26.x) with new runtime (>= 28.x) which uses lazy feature resolution and triggers a race in double-check locking.
Does this mean Spanner proto is currently using an older protoc? If so would Spanner team consider bumping the protoc version for their released client?
That is a very good question. I have no idea why all Cloud libraries have migrated to 4.x protobuf libraries without rebuilding their respective protobufs.
That is just another example of extremely poor management and work coordination related to Java Proto 4.x by Google. First binary incompatibility (the decision that was later reversed), now this. If I were in their shoes, I would have a very hard look in the mirror, but thankfully I don't work there.
However, these mishaps forced me to waste a full workday yesterday to find out what is going on and how to fix that as it started to affect our business, despite this issue being reported over a month ago. I will think twice when I want to update Beam/GCP library versions in the future.
This has been fixed at head by #2261, so I'm going to close this issue. This should be released in the later part of April
Hello, we have recently updated our dataflow pipelines to Apache Beam 2.64.0 as of a week ago and we are still seeing the same errors reported by @kberezin-nshl causing issues with our pipelines. Based on the release notes for release candidates, this fix should have been released 3 weeks ago. We built and deployed a new flex template to update the pipelines last week. Are there still issues with the template build?
Reopen this given https://github.com/GoogleCloudPlatform/DataflowTemplates/issues/2191#issuecomment-2843226343
Is this still happening? I tried to look at a few recent runs, including the RC validation, and I could not find an occurrence.