hudi
hudi copied to clipboard
[SUPPORT] Hudi primary key config is case-sensitive
Tips before filing an issue
-
Have you gone through our FAQs?
-
Join the mailing list to engage in conversations and get faster support at [email protected].
-
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
8350 [task-result-getter-0] WARN org.apache.spark.scheduler.TaskSetManager [] - Lost task 0.0 in stage 9.0 (TID 8) (30.221.115.93 executor driver): org.apache.hudi.exception.HoodieKeyException: recordKey value: "null" for field: "ID" cannot be null or empty.
the primary key config must be in lower case now.
To Reproduce
test("Test primary key case sensitive") {
withTempDir { tmp =>
val tableName = generateTableName
// Create a partitioned table
spark.sql(
s"""
|create table $tableName (
| id int,
| name string,
| price double,
| ts long,
| dt string
|) using hudi
| tblproperties (primaryKey = 'ID'
| )
| partitioned by (dt)
| location '${tmp.getCanonicalPath}'
""".stripMargin)
spark.sql(
s"""
| insert into $tableName
| select 1 as id, 'a1' as name, 10 as price, 1000 as ts, '2021-01-05' as dt
""".stripMargin)
checkAnswer(s"select id, name, price, ts, dt from $tableName")(
Seq(1, "a1", 10.0, 1000 , "2021-01-05")
)
}
}
Expected behavior
the primary key config should be case - insensitive
Environment Description
-
Hudi version : latest master branch
-
Spark version : 3.2.0
-
Hive version :
-
Hadoop version :
-
Storage (HDFS/S3/GCS..) :
-
Running on Docker? (yes/no) :
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.
8350 [task-result-getter-0] WARN org.apache.spark.scheduler.TaskSetManager [] - Lost task 0.0 in stage 9.0 (TID 8) (30.221.115.93 executor driver): org.apache.hudi.exception.HoodieKeyException: recordKey value: "null" for field: "ID" cannot be null or empty.
at org.apache.hudi.keygen.KeyGenUtils.getRecordKey(KeyGenUtils.java:205)
at org.apache.hudi.keygen.SimpleAvroKeyGenerator.getRecordKey(SimpleAvroKeyGenerator.java:50)
at org.apache.hudi.keygen.SimpleKeyGenerator.getRecordKey(SimpleKeyGenerator.java:64)
at org.apache.hudi.keygen.BaseKeyGenerator.getKey(BaseKeyGenerator.java:70)
at org.apache.spark.sql.hudi.command.SqlKeyGenerator.$anonfun$getRecordKey$1(SqlKeyGenerator.scala:79)
at scala.Option.map(Option.scala:230)
at org.apache.spark.sql.hudi.command.SqlKeyGenerator.getRecordKey(SqlKeyGenerator.scala:79)
at org.apache.hudi.HoodieCreateRecordUtils$.getHoodieKeyAndMaybeLocationFromAvroRecord(HoodieCreateRecordUtils.scala:206)
at org.apache.hudi.HoodieCreateRecordUtils$.$anonfun$createHoodieRecordRdd$5(HoodieCreateRecordUtils.scala:133)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:224)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:352)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1508)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1418)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1482)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1305)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750
Nice findings, what is the release of Hudi?
Nice findings, what is the release of Hudi?
I tried it in latest branch
it's great if you can fire a fix for it.
it's great if you can fire a fix for it.
Sorry I'm a bit busy nowadays. Would be great if other contributer can take over it.
@danny0405 @stream2000 why the primary key should be lower case? Shouldn't it be case sensitive?
@danny0405 @stream2000 why the primary key should be lower case? Shouldn't it be case sensitive?
The field names in SQL should be case-insensitive IMO.
@danny0405 @stream2000 why the primary key should be lower case? Shouldn't it be case sensitive?
The field names in SQL should be case-insensitive IMO.
hmm I am curious. Wouldn’t it be better to make it case-sensitive and give the user the option to normalize the key? ( to be simple, I think it's better to keep it as a case-sensitive. )
I kind of think we should follow this ctiteria:
- if the case-insensitivity is enabled, and the primary key is defined from SQL, the primary should also be case-insensitive.
- otherwise, the primary key should be case-seneitive(for e.g. defined from sql or dataframe options.)
@stream2000 @Gatsby-Lee I had done this change some time back and even have test cases for the same. Do you see this issue with 0.15 hudi version also ?
https://github.com/apache/hudi/pull/9020
@Gatsby-Lee I don't see this issue because the primary key I use is already normalized to lower case.
@stream2000 @Gatsby-Lee I had done this change some time back and even have test cases for the same. Do you see this issue with 0.15 hudi version also ?
@ad1happy2go Yes, we can see this issue in 0.15 too. Because the above PR only deal with the config key, bu not deal with the config value ( which could be in upper-case)
@stream2000 Yeah right. I understand now. Thank.
@stream2000 Created jira for the same to track this improvement - https://issues.apache.org/jira/browse/HUDI-8172