qbeast-spark
qbeast-spark copied to clipboard
Analyse the impact of Delete operation in Qbeast Index
This issue is to clarify the status of Delete Operation in Qbeast Spark library and which are the further steps on the roadmap.
DELETE is a basic Data Management operation supported in all Open Table Formats (Delta, Iceberg, and Hudi). It allows the removal of specific rows from a Table and can usually can be done in 2 strategies:
- Merge On Read. The rows are marked as deleted and are discarded at read time.
- Copy on Write. The files where the records are placed would be deleted and the data is rewritten again without the removed records.
As a consequence of interoperability between Formats and Qbeast, this operation can be executed through Delta's interface.
dt = delta.DeltaTable.forPath(spark, "tmp/qbeast-table")
dt.delete(F.col("age") > 75)
As a default strategy, Delta would use Copy on Write mechanism: delete files and add new ones. Deleting files means that the AddFile entry with the corresponding Qbeast Metadata would no longer be available in the Snapshot, and the newly written file would neither contain the appropriate tags to rebuild the OTree.
Or, in other words: the operation could potentially harm the index structure.
Things to do:
- [x] Add an entry in the Documentation that addresses the current limitations.
- [x] Analyze the impact of missing blocks.
- [x] Analyze the impact of missing cubes.
- [ ] Propose a mechanism to maintain a correct structure even if some files are missing OR develop a mechanism to ensure deletes maintain the index in a correct shape.