How to insert overwrite with a single commit
Query engine
Apache Hive
Question
Hello Iceberg Community,
Background: implementation of Iceberg compaction in Apache Hive.
Presently, Apache Hive has Major Query-Based Iceberg compaction which compacts the whole table by internally executing the command insert overwrite table <TableName> select * from <TableName>;
Since Iceberg IOW isn't supported on a table that has had partition/schema evolution as it can lead to wrong results upon querying, at the commit stage this compaction IOW command deletes all files in the tables and adds the new compacted files. That creates 2 snapshots and it can lead to data correctness problem if a user queries the table by the id of the snapshot in which all files have been deleted because it can give an impression that at that point in time there was no data in the table.
Another possibility that we considered is to use RewriteFiles API, which allows to delete all data and delete files and to add new compacted files in one commit, but with this approach it is needed to build a list of all the existing data and delete files to pass them to RewriteFiles API and it can be a problem if a table has thousands of files.
Does Iceberg have API that can perform IOW with a single commit, without listing all the existing data/delete files like with RewriteFIles? If not, can you consider to implement such API?
CC: @gaborkaszab
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'