qbeast-spark Issue #294: Optimization of Unindexed Files [Staging Area]

Issue #294: Optimization of Unindexed Files [Staging Area]

Open osopardo1 opened this issue 1 year ago • 0 comments

Description

Adds #294

Type of change

New Feature. The Unindexed Files of a Qbeast Table were only optimizable from the StagingDataManager component. After thinking about structure and use cases, we noticed that the Staging Area has lost its original purpose (check issue #438) and that we should treat Indexed and Unindexed File separately from the Append execution.

For that reason, we extend the interface of optimization to enable the processing of the unindexed files too.

API

import io.qbeast.spark.QbeastTable

val qbeastTable = QbeastTable.forPath(spark, "/path")
qbeastTable.optimize(revisionID = 0L, fraction = <fraction_to_optimize>)

revisionID = 0L is the selected Revision ID for Unindexed Files.
fraction: Any number from 0.0 to 1.0 that we want to optimize. By default is 1.0. If it's a recently converted table with Qbeast, and contains a lot of legacy data, we suggest reducing the fraction to optimize and doing the operation in batches.

WARNING: each time we execute optimize() for the Unindexed Files it would calculate the bytes to optimize from the current state. If some files had already been indexed, they would not be part of the second iteration.

Implementation

Load the QbeastSnapshot of the Table.
Read the list of Unindexed Files from the QbeastSnapshot.
Select files til fraction * totalBytes threshold is reached.
Apply indexing and roll-up to the Data.
Write the data in files.
In the same transaction: (this step should be done internally as Qbeast Spark, since we have more control of Add Files, Delete Files, and transaction open/close. )
1. mark the old files as Deleted.
2. Add the new file entries.

Checklist:

Here is the list of things you should do before submitting this pull request:

[x] New feature / bug fix has been committed following the Contribution guide.
[x] Add logging to the code following the Contribution guide.
[x] Add comments to the code (make it easier for the community!).
[ ] Change the documentation.
[x] Add tests.
[x] Your branch is updated to the main branch (dependent changes have been merged).

How Has This Been Tested? (Optional)

This has been tested locally with: QbeastOptimizationIntegrationTest.

I've added four cases:

Optimization of a Converted Table. (All data is unindexed).
Optimization of a Hybrid Table after Append. (Some data is unindexed after an external append).
Optimization of a Hybrid Table after Delete. (Some data is deleted and consequently unindexed).
Optimization of a Hybrid Table after Update. (Some data is updated and consequently unindexed).
Optimization of a fraction of a Hybrid Table. (Do not optimize all the Unindexed Files at once).

Test Configuration:

Spark Version: 3.5.0
Hadoop Version: 3.3.4
Cluster or local? Local

Oct 22 '24 09:10 osopardo1

qbeast-spark qbeast-spark copied to clipboard

Issue #294: Optimization of Unindexed Files [Staging Area]

Description

Type of change

API

Implementation

Checklist:

How Has This Been Tested? (Optional)

qbeast-spark
qbeast-spark copied to clipboard