qbeast-spark
qbeast-spark copied to clipboard
Document Non-Deterministic Source Queries and Data Changing Sources
As a first solution for https://github.com/Qbeast-io/qbeast-spark/issues/466, we need to force users to add the columnStats when indexing Tables with the following characteristics:
- Underlying data source changes constantly.
- DataFrame contains non-deterministic columns to index.
- DataFrame contains non-deterministic predicates.
There are different solutions for the process to succeed:
- Add
columnStatsif you are using a default/linear transformation. The usage ofcolumnStatswould infer the data's min/max values before the DataFrame Analysis, which can produce inconsistent results when loading the DataFrame twice for Indexing in any of the above use cases. - For versions packaged after
main, you can change the transformation type for the columns indexed toquantiles, which is more flexible than the default/linear transformation. (Not bounded by min/max, safe to write). - Materialize the data frame before writing to Qbeast. Either in memory if it is a small piece of data, or in the file system.
This procedure should be documented in some sort of FAQ.