qbeast-spark icon indicating copy to clipboard operation
qbeast-spark copied to clipboard

Unclear behaviour of SparkColumnsToIndexSelector when DataFrame is empty

Open osopardo1 opened this issue 10 months ago • 1 comments

What went wrong?

When enabling auto indexing, we call SparkColumnsToIndexSelector to choose which are the best columns to group the data.

This selection is based on statistics and correlations of the data itself, but if no data is provided, the current default behavior is to select the first N columns of the schema.

We should define and concrete if that makes sense and what is the minimum number of columns to index.

osopardo1 avatar Mar 27 '24 10:03 osopardo1

After some discussion, we agreed that, if the DataFrame is empty, makes little sense to use AutoIndexing right away. The code should wait until some data is written to activate the feature.

osopardo1 avatar Apr 09 '24 06:04 osopardo1