qbeast-spark
qbeast-spark copied to clipboard
Using different hash seed for each revision
What went wrong?
I ran into an issue that caused me to lose a lot of time. I was trying to create a new table from a sample of a larger one, but when I tried to index this dataset with Qbeast, all the metrics of the index were off, with cubes way larger than they should be. After a while, I realized that the problem was that I was trying to create an index from a sample of the dataset, so I had already selected only the data points whose hash was lower than a given threshold. Thus, building a new index on the same columns resulted in a very unbalanced distribution of the random weights and a deformed index construction.
How to reproduce?
spark.sql("SELECT * FROM table_qbeast TABLESAMPLE (1 PERCENT)")
.write.format("qbeast").options("columnsToIndex","a,b").saveAsTAble("table_qbeast_2")
val qt = QbeastTable.forPath(spark, "path/to/table_qbeast_2")
println(qt.getIndexMetrics())
Possible solution
A possible solution to avoid this problem would be to select a random hash seed value when creating a new revision so every new write would generate different random weights. However, we have to consider how this would impact migrating data from an old revision to a new one.