qbeast-spark icon indicating copy to clipboard operation
qbeast-spark copied to clipboard

Document Non-Deterministic Source Queries and Data Changing Sources

Open osopardo1 opened this issue 10 months ago • 0 comments

As a first solution for https://github.com/Qbeast-io/qbeast-spark/issues/466, we need to force users to add the columnStats when indexing Tables with the following characteristics:

  • Underlying data source changes constantly.
  • DataFrame contains non-deterministic columns to index.
  • DataFrame contains non-deterministic predicates.

There are different solutions for the process to succeed:

  1. Add columnStats if you are using a default/linear transformation. The usage of columnStats would infer the data's min/max values before the DataFrame Analysis, which can produce inconsistent results when loading the DataFrame twice for Indexing in any of the above use cases.
  2. For versions packaged after main, you can change the transformation type for the columns indexed to quantiles, which is more flexible than the default/linear transformation. (Not bounded by min/max, safe to write).
  3. Materialize the data frame before writing to Qbeast. Either in memory if it is a small piece of data, or in the file system.

This procedure should be documented in some sort of FAQ.

osopardo1 avatar Jan 28 '25 07:01 osopardo1