hudi
hudi copied to clipboard
[SUPPORT] Failed to use Bloom filter Indexing
Describe the problem you faced
- Hudi 0.14.1
- Enabled Metadata Table + Enabled Bloom filter Indexing
- When enabling "hoodie.bloom.index.use.metadata=true" to use the Bloom filter Indexing in Metadata Table, it started failing.
To Reproduce
Steps to reproduce the behavior:
- Enable Metadata Table + Bloom Filter Indexing
- Enable "hoodie.bloom.index.use.metadata=true"
Expected behavior
Once "hoodie.bloom.index.use.metadata=true" is enabled, it has to work with no issue
Environment Description
-
Hudi version : 0.14.1-amzn-0 ( EMR on EKS )
-
Spark version : Spark 3.5.0-amzn-1 ( EMR 7.1 )
Additional context
Add any other context about the problem here.
Stacktrace
2024-08-15 01:25:00
at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:101)
at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:508)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:279)
at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$blockifyObject$4(TorrentBroadcast.scala:365)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95)
at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:367)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:161)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:99)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:38)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:78)
at org.apache.spark.SparkContext.broadcastInternal(SparkContext.scala:1670)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1652)
at org.apache.spark.api.java.JavaSparkContext.broadcast(JavaSparkContext.scala:546)
at org.apache.hudi.index.bloom.SparkHoodieBloomIndexHelper.findMatchingFilesForRecordKeys(SparkHoodieBloomIndexHelper.java:113)
at org.apache.hudi.index.bloom.HoodieBloomIndex.lookupIndex(HoodieBloomIndex.java:135)
at org.apache.hudi.index.bloom.HoodieBloomIndex.tagLocation(HoodieBloomIndex.java:91)
at org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:59)
at org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:41)
at org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:59)
... 95 more
Caused by: java.lang.IllegalArgumentException: Unable to create serializer "com.esotericsoftware.kryo.serializers.FieldSerializer" for class: java.util.concurrent.locks.ReentrantReadWriteLock
at com.esotericsoftware.kryo.factories.ReflectionSerializerFactory.makeSerializer(ReflectionSerializerFactory.java:65)
at com.esotericsoftware.kryo.factories.ReflectionSerializerFactory.makeSerializer(ReflectionSerializerFactory.java:43)
at com.esotericsoftware.kryo.Kryo.newDefaultSerializer(Kryo.java:396)
at com.twitter.chill.KryoBase.newDefaultSerializer(KryoBase.scala:62)
at com.esotericsoftware.kryo.Kryo.getDefaultSerializer(Kryo.java:380)
at com.esotericsoftware.kryo.util.DefaultClassResolver.registerImplicit(DefaultClassResolver.java:74)
at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:508)
at com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:97)
at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:540)
at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:75)
... 116 more
Caused by: java.lang.reflect.InvocationTargetException
at jdk.internal.reflect.GeneratedConstructorAccessor112.newInstance(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
at com.esotericsoftware.kryo.factories.ReflectionSerializerFactory.makeSerializer(ReflectionSerializerFactory.java:51)
... 125 more
Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make field private final java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock java.util.concurrent.locks.ReentrantReadWriteLock.readerLock accessible: module java.base does not "opens java.util.concurrent.locks" to unnamed module @75db5df9
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354)
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297)
at java.base/java.lang.reflect.Field.checkCanSetAccessible(Field.java:178)
at java.base/java.lang.reflect.Field.setAccessible(Field.java:172)
at com.esotericsoftware.kryo.serializers.FieldSerializer.buildValidFields(FieldSerializer.java:283)
at com.esotericsoftware.kryo.serializers.FieldSerializer.rebuildCachedFields(FieldSerializer.java:216)
at com.esotericsoftware.kryo.serializers.FieldSerializer.rebuildCachedFields(FieldSerializer.java:157)
at com.esotericsoftware.kryo.serializers.FieldSerializer.<init>(FieldSerializer.java:150)
at com.esotericsoftware.kryo.serializers.FieldSerializer.<init>(FieldSerializer.java:134)
at com.esotericsoftware.kryo.serializers.FieldSerializer.<init>(FieldSerializer.java:130)
... 130 more
@Gatsby-Lee May be related to Java version. Which java version its using?
- EMR 7.1
checked here - https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-710-release.html It's Java 17.
Checked Hudi 0.14.1 doesn't support Java 17 ( https://hudi.apache.org/roadmap/#execution-engine-integration )
And, find the Jira task - https://issues.apache.org/jira/browse/HUDI-6506
I think you're right. It's Java version related issue.
FYI @CTTY Would you happen to know how to fix this issue in EMR 7.1? Maybe using Java 11?
EMR 7.1 uses Java 17 by default but older Java version should still exist there. I think you can try Java 8/11 in this case: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/configuring-java8.html#configuring-java8-override-spark
EMR 7.1 uses Java 17 by default but older Java version should still exist there. I think you can try Java 8/11 in this case: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/configuring-java8.html#configuring-java8-override-spark
Thank you. I will re-try after setting the java version overriding.
@Gatsby-Lee Please let us know if using Java 11/8 solves this issue, Thanks.
@Gatsby-Lee Please let us know if using Java 11/8 solves this issue, Thanks.
Thank you for following up. I will reply back after I test it 👍
Finally completed testing and confirmed it doesn't work.
- EMR 7.2.0 Java17 ( EMR Hudi 0.14.1 ): FAILED
- EMR 7.2.0 Java11 ( EMR Hudi 0.14.1 ): FAILED
- EMR 7.2.0 Java8 ( EMR Hudi 0.14.1 ): FAILED
- EMR 7.15.0 Java8 by default ( OSS Hudi 0.14.1 ): FAILED
So, I can say it is currently broken.
@ad1happy2go Do you have any updates on this?
@ad1happy2go I am following up this. any updates on this?
Sorry for the delay here @Gatsby-Lee . We will prioritise this. Any further update on this or are you still blocked here?
@Gatsby-Lee I tried to use metadata table and hoodie.bloom.index.use.metadata and on emr-7.1.0 and didn't got any issues. Can you try below or share the code/configurations what you are using?
Code -
pyspark --jars /usr/lib/hudi/hudi-spark-bundle.jar --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog" --conf "spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
schema = StructType(
[
StructField("id", IntegerType(), True),
StructField("name", StringType(), True)
]
)
data = [
Row(1, "a"),
Row(2, "a"),
Row(3, "c"),
]
hudi_configs = {
"hoodie.table.name": TABLE_NAME,
"hoodie.datasource.write.recordkey.field": "name",
"hoodie.datasource.write.precombine.field": "id",
"hoodie.datasource.write.operation":"insert_overwrite_table",
"hoodie.table.keygenerator.class": "org.apache.hudi.keygen.NonpartitionedKeyGenerator",
"hoodie.index.type" : "BLOOM",
"hoodie.metadata.index.bloom.filter.enable" : "true",
"hoodie.bloom.index.use.metadata" : "true"
}
df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.write.format("org.apache.hudi").options(**hudi_configs).mode("overwrite").save(PATH)
spark.read.format("hudi").load(PATH).show()
for i in range(0,30):
df.write.format("org.apache.hudi").options(**hudi_configs).mode("append").save(PATH)
Also, Did you got a chance to try with OSS 0.15.0 version ? OSS 0.14.1 doesn't officially support spark 3.5.
Confirmed by @Gatsby-Lee and closing the issue. Thanks.