hudi
hudi copied to clipboard
[SUPPORT] write data skew when cow type table write data to parquet
when Doing partition and writing data: tenant
i find write data skew
this step cost 9.8 min
task with index 2 cost 9.8min
and this task write parquet file size 792329527 bigger than other file. write this bigger file cost much more time than other file
is there parameter to tuning that the bigger file can be smaller 。then task can concurrency process file。 it seem like i need rewrite whole table to make file sizes more even。 current hudi config hoodie.insert.shuffle.parallelism 200 hoodie.upsert.shuffle.parallelism 200 INDEX_TYPE BLOOM hoodie.parquet.compression.ratio 0.1 hoodie.parquet.max.file.size 125829120 hoodie.copyonwrite.record.size.estimate 300
Environment Description
-
Hudi version :0.11.1
-
Spark version :3.2.2
-
Hive version :3.1.3
-
Hadoop version :3.3.2
Did you try the BUCKET index, it will distribute the keys evenly among buckets, while the bloom_filter index will always try to append to the small buckets first.
@danny0405 thanks you replay. current we do not try bucket index .do you mean in the current situation that we use bloom index. if we don't change index type ,then we want file evenly distributed .we only can reimport table currently
@wkhappy1 Did you tried clustering to fix the file size on existing table
@ad1happy2go sorry,I haven't tried it yet because this code has been running in production for a long time. Are there any considerations or documentation links you can provide if switching from bucket index to bloom index? Thank you very much
@ad1happy2go I also have a question, is it possible that if I don't want to change the index type, I can rewrite the table, as it's not very large?