hudi [SUPPORT] hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates)

Describe the problem you faced

hoodie.combine.before.insert works with bulk_insert if the meta fields are enabled but silently does not work and causes duplicates if disabled (ie. "hoodie.populate.meta.fields": "false").

To Reproduce I provided a trivial reproduction below (hoodie.populate.meta.fields seems like only option that matters on whether bug happens):

# Generate dummy data
from pyspark.sql import Row

input_data = [
    Row(
        id=4,
        value="foo",
        ts=0,
    ),
    Row(
        id=4,
        value="bar",
        ts=1,
    ),
]
df = spark.createDataFrame(input_data)

# Example Hudi configs
hudi_options = {
    "hoodie.table.name": "fake_name",
    "hoodie.datasource.write.table.name": "fake_name",
    "hoodie.datasource.write.table.type": "COPY_ON_WRITE",
    "hoodie.datasource.write.hive_style_partitioning": "true",
    "hoodie.metadata.enable": "false",
    "hoodie.bootstrap.index.enable": "false",
    "hoodie.datasource.write.partitionpath.field": "",
    "hoodie.datasource.write.recordkey.field": "id",
    "hoodie.datasource.write.precombine.field": "ts",
    # Testing out bulk insert
    "hoodie.combine.before.insert": "true",
    "hoodie.datasource.write.operation": "bulk_insert",
}

PATH = "hdfs:///example_github
df.write.format("hudi").options(**hudi_options).mode("overwrite").save(PATH)

# Should be 1 but prints 2
print(spark.read.format("hudi").load(PATH).count())
# Both rows exist
print(spark.read.format("hudi").load(PATH).collect())

Expected behavior The following is surprising:

bulk_insert deduplicates properly if hoodie.populate.meta.fields is enabled
bulk_insert does not deduplicate if hoodie.populate.meta.fields is disabled
insert deduplicates properly regardless of hoodie.populate.meta.fields

I think the user expectation is that bulk_insert has consistent/documented behavior regardless of hoodie.populate.meta.fields. Ideally we would like hoodie.combine.before.insert to work.

Environment Description

This runs on https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-6101-release.html

Hudi version : 0.12.2
Spark version : 3.3.1
Hive version : 3.1.3
Hadoop version : 3.3.3
Storage (HDFS/S3/GCS..) : HDFS/S3 both seem affected
Running on Docker? (yes/no) : Yes

Additional context

We would love to update to a new version of Hudi but there are serious blocking bugs with key generators that are still open:

https://github.com/apache/hudi/issues/10508 ("The ComplexKeyGenerator does not produce the same result for 0.14.1 than previous versions.")
https://github.com/apache/hudi/issues/8372 (CustomKeyGenerator does not work with delete partitions operation)

Is there a way to workaround this on our current version of Hudi?

Apr 17 '24 19:04 mzheng-plaid

@mzheng-plaid or @ad1happy2go have you tried if the same behavior issue is still present in 0.14.x ? just want to quickly isolate the problem.

@mzheng-plaid for other blocking issues as you mentioned, 0.15.0 is on the way with these fixes. Please bear with us

May 02 '24 19:05 xushiyan

@mzheng-plaid This issue still exists in 0.14.1. I confirmed that this issue is only there for bulk_insert and hoodie.populate.meta.fields as false. (as mentioned by @mzheng-plaid )

Created a jira ticket for tracking too - https://issues.apache.org/jira/browse/HUDI-7717

May 06 '24 07:05 ad1happy2go

@mzheng-plaid @ad1happy2go The issue could be closed. Fixed in master branch: 7fc5adad7aa9787e961c36536a08622f62fabe49

May 16 '24 16:05 geserdugarov

Thanks a lot @geserdugarov

May 16 '24 16:05 ad1happy2go

hudi hudi copied to clipboard

[SUPPORT] hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates)

hudi
hudi copied to clipboard