qbeast-spark Using different hash seed for each revision

Using different hash seed for each revision

Open cugni opened this issue 7 months ago • 0 comments

What went wrong?

I ran into an issue that caused me to lose a lot of time. I was trying to create a new table from a sample of a larger one, but when I tried to index this dataset with Qbeast, all the metrics of the index were off, with cubes way larger than they should be. After a while, I realized that the problem was that I was trying to create an index from a sample of the dataset, so I had already selected only the data points whose hash was lower than a given threshold. Thus, building a new index on the same columns resulted in a very unbalanced distribution of the random weights and a deformed index construction.

How to reproduce?


spark.sql("SELECT * FROM table_qbeast TABLESAMPLE (1 PERCENT)")
   .write.format("qbeast").options("columnsToIndex","a,b").saveAsTAble("table_qbeast_2")

val qt = QbeastTable.forPath(spark, "path/to/table_qbeast_2")
println(qt.getIndexMetrics())

Possible solution

A possible solution to avoid this problem would be to select a random hash seed value when creating a new revision so every new write would generate different random weights. However, we have to consider how this would impact migrating data from an old revision to a new one.

Jul 23 '24 07:07 cugni

qbeast-spark qbeast-spark copied to clipboard

Using different hash seed for each revision

What went wrong?

How to reproduce?

Possible solution

qbeast-spark
qbeast-spark copied to clipboard