amoro icon indicating copy to clipboard operation
amoro copied to clipboard

[Feature]: Event-Triggered Optimization of Iceberg Tables in Amoro

Open Jzjsnow opened this issue 3 months ago • 0 comments

Description

Currently, Amoro determines Iceberg table optimizations through periodic full-refresh evaluations at the table level. While this design ensures consistent refreshes, it introduces inefficiencies for large-scale Iceberg tables with continuously growing data volumes.

From log analysis on an Amoro system with 40,000+ tables, we observed:

  • Numerous redundant scans during refresh.
  • Low evaluation efficiency, as many scans did not lead to effective optimizations.
  • Resource wastage and system delays, limiting scalability and user experience.

To address these limitations, we propose introducing an event-triggered optimization mechanism that reduces unnecessary scans and enhances evaluation efficiency, while ensuring timely optimizations.

Use case/motivation

  • Large-scale Iceberg tables (tens of thousands) suffer from frequent full-table refresh scans, leading to high resource usage and wasted computation.
  • Users need timely optimizations without the overhead of redundant scans.
  • For real-time scenarios, latency-sensitive workloads cannot tolerate inefficient refresh mechanisms.
  • The current design struggles to balance timeliness vs. resource utilization.

By moving from periodic full-refresh scans to event-driven triggers, Amoro can:

  1. Reduce full-table scans, avoiding redundant work.
  2. Increase effective scan ratio, ensuring scans more frequently lead to actual optimizations.
  3. Improve file merge efficiency, maximizing the effect of each scan.
  4. Ensure timely optimization, even while lowering scan frequency.

Describe the solution

To address pain points in Amoro's periodic optimization refresh mechanism, we propose an ​​event-triggered optimization mechanism​​:

  • Use loaded table metadata and metrics to trigger scan evaluations, reducing redundant scans.
  • Replace periodic table loading with Iceberg commit-driven triggers.

Subtasks

Metadata Metric-Driven Evaluation​

  • [ ] #3775
  • [ ] Documentation updates.
  • [ ] (optional) Bind each computed pendingInput result with snapshotId to avoid repeated scans when in pending state.
  • [ ] (optional) Cache storage for MSE calculation results.
  • [ ] (optional) Cache storage for pendingInput queue (e.g., when >100 partitions need optimization, enable queued optimization).

Iceberg Commit-Triggered Table Runtime Refresh

  • [ ] REST endpoint for reporting external table events: define interface and JSON body format.
  • [ ] Add an ExternalEventReportService to trigger TableRuntimeRefreshExecutor for refresh & evaluation based on commit operations.
  • [ ] Catalog query service API.
  • [ ] Documentation updates.

Related issues

No response

Are you willing to submit a PR?

  • [x] Yes I am willing to submit a PR!

Code of Conduct

Jzjsnow avatar Sep 09 '25 07:09 Jzjsnow