Add native Apache Iceberg table support with CoralCatalog abstraction
What changes are proposed in this pull request, and why are they necessary?
Summary
This PR introduces native Apache Iceberg table support to Coral, enabling direct schema conversion from Iceberg to Calcite's RelDataType without lossy intermediate conversions through Hive's type system. The implementation preserves Iceberg-specific type semantics including timestamp precision and explicit nullability.
Key architectural decision: HiveMetastoreClient remains unchanged and
does NOT extend CoralCatalog. Integration classes use composition
(storing both instances) with runtime dispatch.
New Components
- Catalog Abstraction (
coral-common/src/main/java/com/linkedin/coral/common/catalog/) -
CoralCatalog: Format-agnostic catalog interface withgetTable(),getAllTables(),namespaceExists() -
CoralTable: Unified table metadata interface (name(),properties(),tableType()) -
HiveCoralTable/IcebergCoralTable: Implementations wrapping native Hive/Iceberg table objects - TableType`: Simple enum (TABLE or VIEW)
Iceberg Integration
-
IcebergTable: Calcite ScannableTable implementation for Iceberg tables -
IcebergTypeConverter: Converts Iceberg Schema → Calcite RelDataType with precision preservation -
IcebergHiveTableConverter: Backward compatibility bridge for UDF resolution (converts Iceberg → Hive table object)
Integration Pattern
-
HiveSchema,HiveDbSchema,ToRelConverter: Store bothCoralCatalogandHiveMetastoreClientinstances - Runtime dispatch: if
coralCatalog != nulluse unified path; else ifmsc != nulluse Hive-only path -
HiveMetastoreClientandHiveMscAdaptermarked @Deprecated (still functional, prefer CoralCatalog)
How Reviewers Should Read This Start here:
-
CoralCatalog.java- New abstraction layer interface -
CoralTable.java- Unified table metadata interface -
IcebergCoralTable.java- How Iceberg tables are wrapped -
IcebergTypeConverter.java- Core schema conversion logic
Then review integration:
-
HiveDbSchema.java- Dispatch logic based onCoralTabletype (Iceberg vs Hive) -
IcebergTable.java- Calcite integration -
ToRelConverter.java- Dual-path support (CoralCatalog vs HiveMetastoreClient) -
HiveMetastoreClient.java- Backward compatibility
Test:
-
IcebergTableConverterTest.java- End-to-end Iceberg conversion test
How was this patch tested?
New and existing tests pass integration tests - WIP
Let us not use Dataset as it is not a standard term. Table maybe more appropriate, but I understand you want to use it elsewhere, but we can find an alternative. I think Schema is also closer to calcite terminolorgy than Namespace. As much as possible we should use standard terms, or when in ambiguity, we should be closer to Calcite terminology.
Can we eliminate
if (coralCatalog != null) {
...
} else {
...
}
that we are currently using everywhere and use some adapter class instead?
There are quite a fiew overlapping wrappers:
CoralTable --> HiveCoralTable / IcebergCoralTable
HiveTable / HiveViewTable
IcebergTable
The layering is conceptually unclear. Can we simplify this and merge a few classes?
I have considered a few design options and this seems to make the most sense:
interface CoralTable extends ScannableTable
class HiveTable implements CoralTable
class IcebergTable implements CoralTable
I have considered a few design options and this seems to make the most sense:
interface CoralTable extends ScannableTable class HiveTable implements CoralTable class IcebergTable implements CoralTable
Discussed offline. The motivation of the above was to avoid duplicating implementation layers (i.e., having both HiveTable and HiveCoralTable, IcebergTable and IcebergCoralTable as in the current PR). The idea was to consolidate each table foramt's implementation into a single class that directly implements ScannableTable. However, this approach exposes RelDataType through the CoralTable API, which could make it harder to replace RelDataType in the future, especially CoralTable will be exposed to the engine connectors.
To properly decouple the two, we would need a standalone Coral type system that models schema and type metadata independently. That type system has now been introduced in #558, which can serve as the foundation for adopting an approach that makes CoralTable fully standalone and decoupled from ScannableTable.