Add AutoIndexing
Qbeast Format allows indexing multiple columns to improve the file layout. Right now, the user should explicitly assign the set of columns that she/he wants to index through the columnsToIndex parameter.
The idea is to get rid of enforcing parameters in the configuration and allow choosing the more valuable columns through a Correlation Matrix Analysis. We will order the columns from least to max average correlation and then filter the top n columns to choose the columnsToIndex.
There are two cases in which we should integrate the code for calling the AutoIndexer:
-
If no
columnsToIndexis specified. Use a correlation matrix as a default strategy to choose which columns contain between the 50%-70% of the variance. -
If the option
spark.qbeast.index.autoIndexerEnabled(we can change the name, this is just a suggestion) is set to true.
In either of those cases, the pipeline in IndexedTable.save should work as follows:
- Check if
columnsToIndexis empty /spark.qbeast.index.autoIndexerEnabledis set. - If the result is positive, call
SparkAutoIndexercode to choose the columns to index. The number of columns indexed should be specified with either: an option or a default parameter (spark.qbeast.index.maxColumnsToIndex). [TBD] - Create a new
Revisionwith the output columns from the AutoIndexer. No other workflow is touched.
We should:
- [x] Determine which is the variance percentage desired to choose a column.
- [x] Add the Correlation Matrix code to qbeast-spark.
- [x] Integrate with existing API.
- [x] Add tests.
FYI: we don't use the full PCA. We just partly analyze which columns contain higher variance, without having to create new mapping columns in the dataframe.
I will add the corresponding skeleton to call the PCA method, and tomorrow we can work on integrating @SrTangente code.
WIP, we use a correlation matrix to order the columns from least to max avarega correlation and then filter the top n columns
WIP, we use a correlation matrix to order the columns from least to max avarega correlation and then filter the top n columns
Description updated, thanks!
This feature is merged in 1.0.0-main. Waiting for addition into main when the release is made.
Merged on https://github.com/Qbeast-io/qbeast-spark/pull/284