Jing Zhang
Jing Zhang
@xuzifu666 Thanks for contribution. This pr add throw exception in each write handlers. Is it possible to check all write statuses for failures during the commit phase?
@hudi-bot run azure
@zyclove Data deduplication caused by records with same primary key value are written into different file groups. It seems like the first commit use simple bucket index, because the file...
@zyclove Please try to update the value of HoodieIndexConfig.INDEX_TYPE.key to `BUCKET`? I have upgrade 150+ Spark sql jobs internally which written to HUDI tables from version 010 to version 014,...
@zyclove Guess the previous writer jobs used simple bucket index, and the latest writer jobs did not. It leads to data deduplication, because records with same primary key value are...
@xuzifu666 @codope Please help me confirm whether my analysis of this issue is correct. Is the `FileIndex#sizeInBytes` better to return the `Long.MAX` instead of 0 if `FileIndex` has not done...
Already find the root cause: the query job does not set extensions as `HoodieSparkSessionExtension`, so the `HoodiePruneFileSourcePartitions` is not taking effect. BTW, should we use an overestimate size than 0...
@jonvex @vinothchandar Thanks a lot for attention. I close the issue because I find the root cause of broadcast a large HUDI relation is those query jobs does not set...
I currently solve the problem by set extensions as `HoodieSparkSessionExtension` for jobs which not only which writing to a HUDI table but also read from a HUDI table. Otherwise ,...
@jonvex @vinothchandar BTW, should we use a overestimate size than 0 in `HoodieFileIndex#sizeInBytes` for those query jobs which forget set `HoodieSparkSessionExtension`, to avoid broadcast a very large HUDI table, like...