qbeast-spark icon indicating copy to clipboard operation
qbeast-spark copied to clipboard

Optimization of the Unindexed Files [Staging Area]

Open osopardo1 opened this issue 10 months ago • 5 comments

Qbeast Spark supports reading files not indexed with Qbeast Metadata. There's different situations that can cause a table to have a hybrid state.

  • Different set of writers. One writes with Qbeast, the others with Delta. All writers can commit new files to the transaction log. Those files written as Delta will not contain any Qbeast Metadata.
  • Old Table in Delta or Parquet converted to Qbeast. When we execute the Convert To Qbeast command, we are just adding a single metadata commit to the table, without rewriting or analyzing any of the existing files.
  • Deletes and Updates. If a Table receives an Update or Delete operation and uses the default Copy on Write strategy, it will create new files that are not indexed.

The current behavior is to ignore the non-indexed files when reading and writing, thus disabling part of the Sampling capabilities and reducing the precision when estimating the index. Also, optimization of this "staging area", does not select that subset of files for any rearrangement operation.

This issue is to record and analyze which is the best storyline to follow when Optimizing the Non-Indexed files.

osopardo1 avatar Mar 26 '24 10:03 osopardo1