DATAPLT-1268 Add shortcut for Iceberg-backed Glue Table
Manual Configuration
Optimizer Configuration
The Table Optimizer API has a number of configuration options that are not exposed in CloudFormation.
CompactionConfiguration
Setting CompactionConfiguration can only be done via API calls (ie: CLI) after the resource has been constructed. Compaction can be enabled using this shortcut, but it cannot be configured. For many cases, the default configuration may be sufficient. The following options require post-creation manual configuration:
-
strategy: the default isbinpack. Note that usingsortorz-orderrequires the table to have the sort order manually set via Spark SQL. -
minInputFiles: minimum number of files to in order to initiate a compaction, default is 100 -
deleteFileThershold: minimum number of deletes that must be present in a data file to make it eligible for compaction, default is 1
OrphanFileDeletionConfiguration
CloudFormation includes support for setting the OrphanFileRetentionPeriodInDays property, but the following must be set using the API/CLI:
-
location: a sub-directory in which to look for files, default is the table location -
runRateInHours: interval in hours between orphan file deletion job runs, default is 24
RetentionConfiguration
CloudFormation includes support for setting the cleanExpiredFiles, numberOfSnapshotsToRetain and snapshotRetentionPeriodInDays properties, but the following must be set using the API/CLI:
-
runRateInHours: interval in hours between retention job runs, default is 24
Sort Order
Sort order can only be set using Spark SQL. TODO: add details
Testing
TODO:
- use the shortcut to create some tables and use them
- make sure that example Spark SQL code works for setting order (and that the table keeps working)
- try making a table that uses bucketing (we don't need to do anything extra to support that, right? it's in partition definition? or?)