hudi
hudi copied to clipboard
[SUPPORT] hoodie.combine.before.insert silently broken for bulk_insert if meta fields disabled (causes duplicates)
Describe the problem you faced
hoodie.combine.before.insert
works with bulk_insert
if the meta fields are enabled but silently does not work and causes duplicates if disabled (ie. "hoodie.populate.meta.fields": "false"
).
To Reproduce
I provided a trivial reproduction below (hoodie.populate.meta.fields
seems like only option that matters on whether bug happens):
# Generate dummy data
from pyspark.sql import Row
input_data = [
Row(
id=4,
value="foo",
ts=0,
),
Row(
id=4,
value="bar",
ts=1,
),
]
df = spark.createDataFrame(input_data)
# Example Hudi configs
hudi_options = {
"hoodie.table.name": "fake_name",
"hoodie.datasource.write.table.name": "fake_name",
"hoodie.datasource.write.table.type": "COPY_ON_WRITE",
"hoodie.datasource.write.hive_style_partitioning": "true",
"hoodie.metadata.enable": "false",
"hoodie.bootstrap.index.enable": "false",
"hoodie.datasource.write.partitionpath.field": "",
"hoodie.datasource.write.recordkey.field": "id",
"hoodie.datasource.write.precombine.field": "ts",
# Testing out bulk insert
"hoodie.combine.before.insert": "true",
"hoodie.datasource.write.operation": "bulk_insert",
}
PATH = "hdfs:///example_github
df.write.format("hudi").options(**hudi_options).mode("overwrite").save(PATH)
# Should be 1 but prints 2
print(spark.read.format("hudi").load(PATH).count())
# Both rows exist
print(spark.read.format("hudi").load(PATH).collect())
Expected behavior The following is surprising:
-
bulk_insert
deduplicates properly ifhoodie.populate.meta.fields
is enabled -
bulk_insert
does not deduplicate ifhoodie.populate.meta.fields
is disabled -
insert
deduplicates properly regardless ofhoodie.populate.meta.fields
I think the user expectation is that bulk_insert
has consistent/documented behavior regardless of hoodie.populate.meta.fields
. Ideally we would like hoodie.combine.before.insert
to work.
Environment Description
This runs on https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-6101-release.html
-
Hudi version : 0.12.2
-
Spark version : 3.3.1
-
Hive version : 3.1.3
-
Hadoop version : 3.3.3
-
Storage (HDFS/S3/GCS..) : HDFS/S3 both seem affected
-
Running on Docker? (yes/no) : Yes
Additional context
We would love to update to a new version of Hudi but there are serious blocking bugs with key generators that are still open:
- https://github.com/apache/hudi/issues/10508 ("The ComplexKeyGenerator does not produce the same result for 0.14.1 than previous versions.")
- https://github.com/apache/hudi/issues/8372 (CustomKeyGenerator does not work with delete partitions operation)
Is there a way to workaround this on our current version of Hudi?
@mzheng-plaid or @ad1happy2go have you tried if the same behavior issue is still present in 0.14.x ? just want to quickly isolate the problem.
@mzheng-plaid for other blocking issues as you mentioned, 0.15.0 is on the way with these fixes. Please bear with us
@mzheng-plaid This issue still exists in 0.14.1. I confirmed that this issue is only there for bulk_insert and hoodie.populate.meta.fields as false. (as mentioned by @mzheng-plaid )
Created a jira ticket for tracking too - https://issues.apache.org/jira/browse/HUDI-7717
@mzheng-plaid @ad1happy2go The issue could be closed. Fixed in master branch: 7fc5adad7aa9787e961c36536a08622f62fabe49
Thanks a lot @geserdugarov