Manu

Results 28 comments of Manu

set "hoodie.storage.layout.partitioner.class" = "org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner", and try again?

Hi @xushiyan @yihua @15663671003, I create a pr to add default partitioner for SIMPLE BUCKET index, please have a look.

> @xicm we have to shade the HBase classes to be compatible with Hive query engine which introduces HBase classes as well. Does changing all relevant class names with shading...

This page https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-considerations.html provides a way to work around, but the problem in spark bundle still exists.

Add a partition field means more tasks. And the index is BUCKET, the tasks could be bucket_num*partitions in some cases.

not sure if this is the cause, can you check the number of file groups after partition field changed, and reduce the bucket number to see the time cost.

> can you tell me how to check number of filegroup? cli or spark sql, show_commits, pay attention to `total_files_added` and `total_files_updated` > it is still taking 45-50 min to...

Small bucket num will not fit the growing data. Generally We estimate the data size to determine the number of buckets. I think you problem is the data is too...

Can you check the *SubTasks* of bucket_assigner in flink ui. This tells us how many tasks in a write operation.