Jing Zhang comments

Results 14 comments of


                                            Jing Zhang

[HUDI-7208] Do writing stage should shutdown with error when insert failed to reduce user execute time and show error details

@xuzifu666 Thanks for contribution. This pr add throw exception in each write handlers. Is it possible to check all write statuses for failures during the commit phase?

[Draft][HUDI-7265] Support schema evolution by Flink SQL using HoodieHiveCatalog

@hudi-bot run azure

[SUPPORT]0.12.3 upgrade to 0.14.0 data duplication

@zyclove Data deduplication caused by records with same primary key value are written into different file groups. It seems like the first commit use simple bucket index, because the file...

[SUPPORT]0.12.3 upgrade to 0.14.0 data duplication

@zyclove Please try to update the value of HoodieIndexConfig.INDEX_TYPE.key to `BUCKET`? I have upgrade 150+ Spark sql jobs internally which written to HUDI tables from version 010 to version 014,...

[SUPPORT]0.12.3 upgrade to 0.14.0 data duplication

@zyclove Guess the previous writer jobs used simple bucket index, and the latest writer jobs did not. It leads to data deduplication, because records with same primary key value are...

[SUPPORT] Spark planner choose broadcast hash join for large HUDI data source

@xuzifu666 @codope Please help me confirm whether my analysis of this issue is correct. Is the `FileIndex#sizeInBytes` better to return the `Long.MAX` instead of 0 if `FileIndex` has not done...

[SUPPORT] Spark planner choose broadcast hash join for large HUDI data source

Already find the root cause: the query job does not set extensions as `HoodieSparkSessionExtension`, so the `HoodiePruneFileSourcePartitions` is not taking effect. BTW, should we use an overestimate size than 0...

[HUDI-7241] Avoid always broadcast HUDI relation if not using HoodieSparkSessionExtension

@jonvex @vinothchandar Thanks a lot for attention. I close the issue because I find the root cause of broadcast a large HUDI relation is those query jobs does not set...

[HUDI-7241] Avoid always broadcast HUDI relation if not using HoodieSparkSessionExtension

I currently solve the problem by set extensions as `HoodieSparkSessionExtension` for jobs which not only which writing to a HUDI table but also read from a HUDI table. Otherwise ,...

[HUDI-7241] Avoid always broadcast HUDI relation if not using HoodieSparkSessionExtension

@jonvex @vinothchandar BTW, should we use a overestimate size than 0 in `HoodieFileIndex#sizeInBytes` for those query jobs which forget set `HoodieSparkSessionExtension`, to avoid broadcast a very large HUDI table, like...