hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] write data skew when cow type table write data to parquet

Open wkhappy1 opened this issue 1 year ago • 5 comments

when Doing partition and writing data: tenant i find write data skew 1

this step cost 9.8 min

2

task with index 2 cost 9.8min

3

and this task write parquet file size 792329527 bigger than other file. write this bigger file cost much more time than other file

is there parameter to tuning that the bigger file can be smaller 。then task can concurrency process file。 it seem like i need rewrite whole table to make file sizes more even。 current hudi config hoodie.insert.shuffle.parallelism 200 hoodie.upsert.shuffle.parallelism 200 INDEX_TYPE BLOOM hoodie.parquet.compression.ratio 0.1 hoodie.parquet.max.file.size 125829120 hoodie.copyonwrite.record.size.estimate 300

Environment Description

  • Hudi version :0.11.1

  • Spark version :3.2.2

  • Hive version :3.1.3

  • Hadoop version :3.3.2

wkhappy1 avatar Apr 22 '24 09:04 wkhappy1

Did you try the BUCKET index, it will distribute the keys evenly among buckets, while the bloom_filter index will always try to append to the small buckets first.

danny0405 avatar Apr 27 '24 00:04 danny0405

@danny0405 thanks you replay. current we do not try bucket index .do you mean in the current situation that we use bloom index. if we don't change index type ,then we want file evenly distributed .we only can reimport table currently

wkhappy1 avatar Apr 27 '24 02:04 wkhappy1

@wkhappy1 Did you tried clustering to fix the file size on existing table

ad1happy2go avatar May 15 '24 14:05 ad1happy2go

@ad1happy2go sorry,I haven't tried it yet because this code has been running in production for a long time. Are there any considerations or documentation links you can provide if switching from bucket index to bloom index? Thank you very much

wkhappy1 avatar May 17 '24 01:05 wkhappy1

@ad1happy2go I also have a question, is it possible that if I don't want to change the index type, I can rewrite the table, as it's not very large?

wkhappy1 avatar May 17 '24 01:05 wkhappy1