hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] Failed to use Bloom filter Indexing

Open Gatsby-Lee opened this issue 1 year ago • 11 comments

Describe the problem you faced

  • Hudi 0.14.1
  • Enabled Metadata Table + Enabled Bloom filter Indexing
  • When enabling "hoodie.bloom.index.use.metadata=true" to use the Bloom filter Indexing in Metadata Table, it started failing.

To Reproduce

Steps to reproduce the behavior:

  1. Enable Metadata Table + Bloom Filter Indexing
  2. Enable "hoodie.bloom.index.use.metadata=true"

Expected behavior

Once "hoodie.bloom.index.use.metadata=true" is enabled, it has to work with no issue

Environment Description

  • Hudi version : 0.14.1-amzn-0 ( EMR on EKS )

  • Spark version : Spark 3.5.0-amzn-1 ( EMR 7.1 )

Additional context

Add any other context about the problem here.

Stacktrace

2024-08-15 01:25:00	
	at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:101)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:508)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:279)
	at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$blockifyObject$4(TorrentBroadcast.scala:365)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95)
	at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:367)
	at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:161)
	at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:99)
	at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:38)
	at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:78)
	at org.apache.spark.SparkContext.broadcastInternal(SparkContext.scala:1670)
	at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1652)
	at org.apache.spark.api.java.JavaSparkContext.broadcast(JavaSparkContext.scala:546)
	at org.apache.hudi.index.bloom.SparkHoodieBloomIndexHelper.findMatchingFilesForRecordKeys(SparkHoodieBloomIndexHelper.java:113)
	at org.apache.hudi.index.bloom.HoodieBloomIndex.lookupIndex(HoodieBloomIndex.java:135)
	at org.apache.hudi.index.bloom.HoodieBloomIndex.tagLocation(HoodieBloomIndex.java:91)
	at org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:59)
	at org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:41)
	at org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:59)
	... 95 more
Caused by: java.lang.IllegalArgumentException: Unable to create serializer "com.esotericsoftware.kryo.serializers.FieldSerializer" for class: java.util.concurrent.locks.ReentrantReadWriteLock
	at com.esotericsoftware.kryo.factories.ReflectionSerializerFactory.makeSerializer(ReflectionSerializerFactory.java:65)
	at com.esotericsoftware.kryo.factories.ReflectionSerializerFactory.makeSerializer(ReflectionSerializerFactory.java:43)
	at com.esotericsoftware.kryo.Kryo.newDefaultSerializer(Kryo.java:396)
	at com.twitter.chill.KryoBase.newDefaultSerializer(KryoBase.scala:62)
	at com.esotericsoftware.kryo.Kryo.getDefaultSerializer(Kryo.java:380)
	at com.esotericsoftware.kryo.util.DefaultClassResolver.registerImplicit(DefaultClassResolver.java:74)
	at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:508)
	at com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:97)
	at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:540)
	at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:75)
	... 116 more
Caused by: java.lang.reflect.InvocationTargetException
	at jdk.internal.reflect.GeneratedConstructorAccessor112.newInstance(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
	at com.esotericsoftware.kryo.factories.ReflectionSerializerFactory.makeSerializer(ReflectionSerializerFactory.java:51)
	... 125 more
Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make field private final java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock java.util.concurrent.locks.ReentrantReadWriteLock.readerLock accessible: module java.base does not "opens java.util.concurrent.locks" to unnamed module @75db5df9
	at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354)
	at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297)
	at java.base/java.lang.reflect.Field.checkCanSetAccessible(Field.java:178)
	at java.base/java.lang.reflect.Field.setAccessible(Field.java:172)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.buildValidFields(FieldSerializer.java:283)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.rebuildCachedFields(FieldSerializer.java:216)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.rebuildCachedFields(FieldSerializer.java:157)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.<init>(FieldSerializer.java:150)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.<init>(FieldSerializer.java:134)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.<init>(FieldSerializer.java:130)
	... 130 more

Gatsby-Lee avatar Aug 15 '24 23:08 Gatsby-Lee

@Gatsby-Lee May be related to Java version. Which java version its using?

ad1happy2go avatar Aug 17 '24 11:08 ad1happy2go

  • EMR 7.1

checked here - https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-710-release.html It's Java 17.

Checked Hudi 0.14.1 doesn't support Java 17 ( https://hudi.apache.org/roadmap/#execution-engine-integration )

And, find the Jira task - https://issues.apache.org/jira/browse/HUDI-6506

I think you're right. It's Java version related issue.

FYI @CTTY Would you happen to know how to fix this issue in EMR 7.1? Maybe using Java 11?

Gatsby-Lee avatar Aug 19 '24 09:08 Gatsby-Lee

EMR 7.1 uses Java 17 by default but older Java version should still exist there. I think you can try Java 8/11 in this case: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/configuring-java8.html#configuring-java8-override-spark

CTTY avatar Aug 19 '24 21:08 CTTY

EMR 7.1 uses Java 17 by default but older Java version should still exist there. I think you can try Java 8/11 in this case: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/configuring-java8.html#configuring-java8-override-spark

Thank you. I will re-try after setting the java version overriding.

Gatsby-Lee avatar Aug 19 '24 22:08 Gatsby-Lee

@Gatsby-Lee Please let us know if using Java 11/8 solves this issue, Thanks.

ad1happy2go avatar Aug 22 '24 04:08 ad1happy2go

@Gatsby-Lee Please let us know if using Java 11/8 solves this issue, Thanks.

Thank you for following up. I will reply back after I test it 👍

Gatsby-Lee avatar Aug 22 '24 06:08 Gatsby-Lee

Finally completed testing and confirmed it doesn't work.

  • EMR 7.2.0 Java17 ( EMR Hudi 0.14.1 ): FAILED
  • EMR 7.2.0 Java11 ( EMR Hudi 0.14.1 ): FAILED
  • EMR 7.2.0 Java8 ( EMR Hudi 0.14.1 ): FAILED
  • EMR 7.15.0 Java8 by default ( OSS Hudi 0.14.1 ): FAILED

So, I can say it is currently broken.

Gatsby-Lee avatar Sep 04 '24 09:09 Gatsby-Lee

@ad1happy2go Do you have any updates on this?

Gatsby-Lee avatar Sep 20 '24 09:09 Gatsby-Lee

@ad1happy2go I am following up this. any updates on this?

Gatsby-Lee avatar Oct 01 '24 23:10 Gatsby-Lee

Sorry for the delay here @Gatsby-Lee . We will prioritise this. Any further update on this or are you still blocked here?

ad1happy2go avatar Oct 17 '24 10:10 ad1happy2go

@Gatsby-Lee I tried to use metadata table and hoodie.bloom.index.use.metadata and on emr-7.1.0 and didn't got any issues. Can you try below or share the code/configurations what you are using?

Code -

pyspark --jars /usr/lib/hudi/hudi-spark-bundle.jar --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog" --conf "spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
schema = StructType(
    [
        StructField("id", IntegerType(), True),
        StructField("name", StringType(), True)
    ]
)

data = [
    Row(1, "a"),
    Row(2, "a"),
    Row(3, "c"),
]


hudi_configs = {
    "hoodie.table.name": TABLE_NAME,
    "hoodie.datasource.write.recordkey.field": "name",
    "hoodie.datasource.write.precombine.field": "id",
    "hoodie.datasource.write.operation":"insert_overwrite_table",
    "hoodie.table.keygenerator.class": "org.apache.hudi.keygen.NonpartitionedKeyGenerator",
    "hoodie.index.type" : "BLOOM",
    "hoodie.metadata.index.bloom.filter.enable" : "true",
    "hoodie.bloom.index.use.metadata" : "true"
}

df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)

df.write.format("org.apache.hudi").options(**hudi_configs).mode("overwrite").save(PATH)

spark.read.format("hudi").load(PATH).show()

for i in range(0,30):
    df.write.format("org.apache.hudi").options(**hudi_configs).mode("append").save(PATH)

Also, Did you got a chance to try with OSS 0.15.0 version ? OSS 0.14.1 doesn't officially support spark 3.5.

ad1happy2go avatar Oct 17 '24 11:10 ad1happy2go

Confirmed by @Gatsby-Lee and closing the issue. Thanks.

ad1happy2go avatar Oct 23 '24 07:10 ad1happy2go