hudi [SUPPORT] write data skew when cow type table write data to parquet

[SUPPORT] write data skew when cow type table write data to parquet

Open wkhappy1 opened this issue 1 year ago • 5 comments

when Doing partition and writing data: tenant i find write data skew

this step cost 9.8 min

task with index 2 cost 9.8min

and this task write parquet file size 792329527 bigger than other file. write this bigger file cost much more time than other file

is there parameter to tuning that the bigger file can be smaller 。then task can concurrency process file。 it seem like i need rewrite whole table to make file sizes more even。 current hudi config hoodie.insert.shuffle.parallelism 200 hoodie.upsert.shuffle.parallelism 200 INDEX_TYPE BLOOM hoodie.parquet.compression.ratio 0.1 hoodie.parquet.max.file.size 125829120 hoodie.copyonwrite.record.size.estimate 300

Environment Description

Hudi version :0.11.1
Spark version :3.2.2
Hive version :3.1.3
Hadoop version :3.3.2

Apr 22 '24 09:04 wkhappy1

Did you try the BUCKET index, it will distribute the keys evenly among buckets, while the bloom_filter index will always try to append to the small buckets first.

Apr 27 '24 00:04 danny0405

@danny0405 thanks you replay. current we do not try bucket index .do you mean in the current situation that we use bloom index. if we don't change index type ,then we want file evenly distributed .we only can reimport table currently

Apr 27 '24 02:04 wkhappy1

@wkhappy1 Did you tried clustering to fix the file size on existing table

May 15 '24 14:05 ad1happy2go

@ad1happy2go sorry,I haven't tried it yet because this code has been running in production for a long time. Are there any considerations or documentation links you can provide if switching from bucket index to bloom index? Thank you very much

May 17 '24 01:05 wkhappy1

@ad1happy2go I also have a question, is it possible that if I don't want to change the index type, I can rewrite the table, as it's not very large?

May 17 '24 01:05 wkhappy1

hudi hudi copied to clipboard

[SUPPORT] write data skew when cow type table write data to parquet

hudi
hudi copied to clipboard