qbeast-spark Add AutoIndexing

Qbeast Format allows indexing multiple columns to improve the file layout. Right now, the user should explicitly assign the set of columns that she/he wants to index through the columnsToIndex parameter.

The idea is to get rid of enforcing parameters in the configuration and allow choosing the more valuable columns through a Correlation Matrix Analysis. We will order the columns from least to max average correlation and then filter the top n columns to choose the columnsToIndex.

There are two cases in which we should integrate the code for calling the AutoIndexer:

If no columnsToIndex is specified. Use a correlation matrix as a default strategy to choose which columns contain between the 50%-70% of the variance.
If the option spark.qbeast.index.autoIndexerEnabled (we can change the name, this is just a suggestion) is set to true.

In either of those cases, the pipeline in IndexedTable.save should work as follows:

Check if columnsToIndex is empty / spark.qbeast.index.autoIndexerEnabled is set.
If the result is positive, call SparkAutoIndexer code to choose the columns to index. The number of columns indexed should be specified with either: an option or a default parameter (spark.qbeast.index.maxColumnsToIndex). [TBD]
Create a new Revision with the output columns from the AutoIndexer. No other workflow is touched.

We should:

[x] Determine which is the variance percentage desired to choose a column.
[x] Add the Correlation Matrix code to qbeast-spark.
[x] Integrate with existing API.
[x] Add tests.

Dec 04 '23 08:12 osopardo1

FYI: we don't use the full PCA. We just partly analyze which columns contain higher variance, without having to create new mapping columns in the dataframe.

Dec 04 '23 10:12 osopardo1

I will add the corresponding skeleton to call the PCA method, and tomorrow we can work on integrating @SrTangente code.

Dec 04 '23 16:12 osopardo1

WIP, we use a correlation matrix to order the columns from least to max avarega correlation and then filter the top n columns

Dec 11 '23 08:12 SrTangente

WIP, we use a correlation matrix to order the columns from least to max avarega correlation and then filter the top n columns

Description updated, thanks!

Dec 11 '23 14:12 osopardo1

This feature is merged in 1.0.0-main. Waiting for addition into main when the release is made.

Jan 08 '24 13:01 osopardo1

Merged on https://github.com/Qbeast-io/qbeast-spark/pull/284

Mar 27 '24 13:03 osopardo1