hudi
hudi copied to clipboard
[HUDI-8126] Persist sourceRdd to optimise writeStatus DAG
Change Logs
Persist the sourceRDD in fetchNext method for optimising error table DAG. It helps in avoiding expensive source reads when writing to the error table.
Impact
Cuts down the overall e2e sync latency when hoodie.errortable.enable
is enabled as we will avoid reading from the source twice.
Ran a load test by ingesting 15GB to see the improvements for the three approaches. Union DAG Optimisation PR -> https://github.com/apache/hudi/pull/11843/files
1. Current version ➝ 1h 29min.
2. Union DAG Optimisation ➝ 1h 9min.
3. Source RDD Persist + Union DAG Optimisation ➝ 43min.
Risk level (write none, low medium or high below)
Medium.
Documentation Update
public static final ConfigProperty<Boolean> ERROR_TABLE_PERSIST_SOURCE_RDD = ConfigProperty
.key("hoodie.errortable.source.rdd.persist")
.defaultValue(false)
.withDocumentation("Enabling this config, persists the sourceRDD to disk which helps in faster processing of data table + error table write DAG");
Contributor's checklist
- [x] Read through contributor's guide
- [x] Change Logs and Impact were stated clearly
- [x] Adequate tests were added if applicable
- [x] CI passed