coral icon indicating copy to clipboard operation
coral copied to clipboard

Add native Apache Iceberg table support with CoralCatalog abstraction

Open aastha25 opened this issue 2 months ago • 5 comments

What changes are proposed in this pull request, and why are they necessary?

Summary This PR introduces native Apache Iceberg table support to Coral, enabling direct schema conversion from Iceberg to Calcite's RelDataType without lossy intermediate conversions through Hive's type system. The implementation preserves Iceberg-specific type semantics including timestamp precision and explicit nullability. Key architectural decision: HiveMetastoreClient remains unchanged and does NOT extend CoralCatalog. Integration classes use composition (storing both instances) with runtime dispatch.

New Components

  • Catalog Abstraction (coral-common/src/main/java/com/linkedin/coral/common/catalog/)
  • CoralCatalog: Format-agnostic catalog interface with getTable(), getAllTables(), namespaceExists()
  • CoralTable: Unified table metadata interface (name(), properties(), tableType())
  • HiveCoralTable / IcebergCoralTable: Implementations wrapping native Hive/Iceberg table objects
  • TableType`: Simple enum (TABLE or VIEW)

Iceberg Integration

  • IcebergTable: Calcite ScannableTable implementation for Iceberg tables
  • IcebergTypeConverter: Converts Iceberg Schema → Calcite RelDataType with precision preservation
  • IcebergHiveTableConverter: Backward compatibility bridge for UDF resolution (converts Iceberg → Hive table object)

Integration Pattern

  • HiveSchema, HiveDbSchema, ToRelConverter: Store both CoralCatalog and HiveMetastoreClient instances
  • Runtime dispatch: if coralCatalog != null use unified path; else if msc != null use Hive-only path
  • HiveMetastoreClient and HiveMscAdapter marked @Deprecated (still functional, prefer CoralCatalog)

How Reviewers Should Read This Start here:

  1. CoralCatalog.java - New abstraction layer interface
  2. CoralTable.java - Unified table metadata interface
  3. IcebergCoralTable.java - How Iceberg tables are wrapped
  4. IcebergTypeConverter.java - Core schema conversion logic

Then review integration:

  1. HiveDbSchema.java - Dispatch logic based on CoralTable type (Iceberg vs Hive)
  2. IcebergTable.java - Calcite integration
  3. ToRelConverter.java - Dual-path support (CoralCatalog vs HiveMetastoreClient)
  4. HiveMetastoreClient.java - Backward compatibility

Test:

  1. IcebergTableConverterTest.java - End-to-end Iceberg conversion test

How was this patch tested?

New and existing tests pass integration tests - WIP

aastha25 avatar Oct 24 '25 21:10 aastha25

Let us not use Dataset as it is not a standard term. Table maybe more appropriate, but I understand you want to use it elsewhere, but we can find an alternative. I think Schema is also closer to calcite terminolorgy than Namespace. As much as possible we should use standard terms, or when in ambiguity, we should be closer to Calcite terminology.

wmoustafa avatar Oct 27 '25 06:10 wmoustafa

Can we eliminate

if (coralCatalog != null) {
  ...
} else {
  ...
}

that we are currently using everywhere and use some adapter class instead?

wmoustafa avatar Oct 31 '25 05:10 wmoustafa

There are quite a fiew overlapping wrappers:

CoralTable --> HiveCoralTable / IcebergCoralTable
HiveTable / HiveViewTable
IcebergTable

The layering is conceptually unclear. Can we simplify this and merge a few classes?

wmoustafa avatar Oct 31 '25 05:10 wmoustafa

I have considered a few design options and this seems to make the most sense:

interface CoralTable extends ScannableTable
class HiveTable implements CoralTable
class IcebergTable implements CoralTable

wmoustafa avatar Oct 31 '25 06:10 wmoustafa

I have considered a few design options and this seems to make the most sense:

interface CoralTable extends ScannableTable
class HiveTable implements CoralTable
class IcebergTable implements CoralTable

Discussed offline. The motivation of the above was to avoid duplicating implementation layers (i.e., having both HiveTable and HiveCoralTable, IcebergTable and IcebergCoralTable as in the current PR). The idea was to consolidate each table foramt's implementation into a single class that directly implements ScannableTable. However, this approach exposes RelDataType through the CoralTable API, which could make it harder to replace RelDataType in the future, especially CoralTable will be exposed to the engine connectors.

To properly decouple the two, we would need a standalone Coral type system that models schema and type metadata independently. That type system has now been introduced in #558, which can serve as the foundation for adopting an approach that makes CoralTable fully standalone and decoupled from ScannableTable.

wmoustafa avatar Nov 01 '25 09:11 wmoustafa