qbeast-spark
qbeast-spark copied to clipboard
Issue #294: Optimization of Unindexed Files [Staging Area]
Description
Adds #294
Type of change
New Feature. The Unindexed Files of a Qbeast Table were only optimizable from the StagingDataManager component. After thinking about structure and use cases, we noticed that the Staging Area has lost its original purpose (check issue #438) and that we should treat Indexed and Unindexed File separately from the Append execution.
For that reason, we extend the interface of optimization to enable the processing of the unindexed files too.
API
import io.qbeast.spark.QbeastTable
val qbeastTable = QbeastTable.forPath(spark, "/path")
qbeastTable.optimize(revisionID = 0L, fraction = <fraction_to_optimize>)
revisionID = 0Lis the selected Revision ID for Unindexed Files.fraction: Any number from 0.0 to 1.0 that we want to optimize. By default is 1.0. If it's a recently converted table with Qbeast, and contains a lot of legacy data, we suggest reducing the fraction to optimize and doing the operation in batches.
WARNING: each time we execute optimize() for the Unindexed Files it would calculate the bytes to optimize from the current state. If some files had already been indexed, they would not be part of the second iteration.
Implementation
- Load the
QbeastSnapshotof the Table. - Read the list of Unindexed Files from the
QbeastSnapshot. - Select files til
fraction * totalBytesthreshold is reached. - Apply indexing and roll-up to the Data.
- Write the data in files.
- In the same transaction: (this step should be done internally as Qbeast Spark, since we have more control of Add Files, Delete Files, and transaction open/close. )
- mark the old files as Deleted.
- Add the new file entries.
Checklist:
Here is the list of things you should do before submitting this pull request:
- [x] New feature / bug fix has been committed following the Contribution guide.
- [x] Add logging to the code following the Contribution guide.
- [x] Add comments to the code (make it easier for the community!).
- [ ] Change the documentation.
- [x] Add tests.
- [x] Your branch is updated to the main branch (dependent changes have been merged).
How Has This Been Tested? (Optional)
This has been tested locally with: QbeastOptimizationIntegrationTest.
I've added four cases:
- Optimization of a Converted Table. (All data is unindexed).
- Optimization of a Hybrid Table after Append. (Some data is unindexed after an external append).
- Optimization of a Hybrid Table after Delete. (Some data is deleted and consequently unindexed).
- Optimization of a Hybrid Table after Update. (Some data is updated and consequently unindexed).
- Optimization of a fraction of a Hybrid Table. (Do not optimize all the Unindexed Files at once).
Test Configuration:
- Spark Version: 3.5.0
- Hadoop Version: 3.3.4
- Cluster or local? Local