hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-8126] Persist sourceRdd to optimise writeStatus DAG

Open vinishjail97 opened this issue 6 months ago • 1 comments

Change Logs

Persist the sourceRDD in fetchNext method for optimising error table DAG. It helps in avoiding expensive source reads when writing to the error table.

Impact

Cuts down the overall e2e sync latency when hoodie.errortable.enable is enabled as we will avoid reading from the source twice.

Ran a load test by ingesting 15GB to see the improvements for the three approaches. Union DAG Optimisation PR -> https://github.com/apache/hudi/pull/11843/files

1. Current version ➝ 1h 29min. 

2. Union DAG Optimisation ➝ 1h 9min. 

3. Source RDD Persist + Union DAG Optimisation ➝ 43min.

Risk level (write none, low medium or high below)

Medium.

Documentation Update

  public static final ConfigProperty<Boolean> ERROR_TABLE_PERSIST_SOURCE_RDD = ConfigProperty
      .key("hoodie.errortable.source.rdd.persist")
      .defaultValue(false)
      .withDocumentation("Enabling this config, persists the sourceRDD to disk which helps in faster processing of data table + error table write DAG");

Contributor's checklist

  • [x] Read through contributor's guide
  • [x] Change Logs and Impact were stated clearly
  • [x] Adequate tests were added if applicable
  • [x] CI passed

vinishjail97 avatar Aug 27 '24 18:08 vinishjail97