hudi
hudi copied to clipboard
[HUDI-8071] Handle skew for user defined sort columns in BULK_INSERT
Change Logs
If there is a skew in user defined columns for sortKey, spark sort reduces the number of tasks and this leads to an increase in contention when writing parquet files.
Impact
None, this handles skew for partitioners honouring user defined sort columns.
Risk level (write none, low medium or high below)
Low. The behaviour is behind this config and by default it's false.
Documentation Update
public static final ConfigProperty<Boolean> BULKINSERT_SUFFIX_RECORD_KEY_FOR_USER_DEFINED_SORT_COLUMNS = ConfigProperty
.key("hoodie.bulkinsert.suffix.record_key.user.defined.sort.columns")
.defaultValue(false)
.markAdvanced()
.withDocumentation(
"When using user defined sort columns there can be possibility of skew and can cause increase in commit durations, "
+ "enabling this config suffixes the record key at the end to avoid skew");
Contributor's checklist
- [x] Read through contributor's guide
- [x] Change Logs and Impact were stated clearly
- [x] Adequate tests were added if applicable
- [x] CI passed