hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-8071] Handle skew for user defined sort columns in BULK_INSERT

Open vinishjail97 opened this issue 6 months ago • 1 comments

Change Logs

If there is a skew in user defined columns for sortKey, spark sort reduces the number of tasks and this leads to an increase in contention when writing parquet files.

Impact

None, this handles skew for partitioners honouring user defined sort columns.

Risk level (write none, low medium or high below)

Low. The behaviour is behind this config and by default it's false.

Documentation Update

  public static final ConfigProperty<Boolean> BULKINSERT_SUFFIX_RECORD_KEY_FOR_USER_DEFINED_SORT_COLUMNS = ConfigProperty
      .key("hoodie.bulkinsert.suffix.record_key.user.defined.sort.columns")
      .defaultValue(false)
      .markAdvanced()
      .withDocumentation(
          "When using user defined sort columns there can be possibility of skew and can cause increase in commit durations, "
              + "enabling this config suffixes the record key at the end to avoid skew");

Contributor's checklist

  • [x] Read through contributor's guide
  • [x] Change Logs and Impact were stated clearly
  • [x] Adequate tests were added if applicable
  • [x] CI passed

vinishjail97 avatar Aug 12 '24 12:08 vinishjail97