qbeast-spark
qbeast-spark copied to clipboard
Add min-max column information
Right now we add block information on different metrics like cube, weight and state onto the delta commit log.
val tags = Map(
cubeTag -> cube,
weightMinTag -> minWeight.toString,
weightMaxTag -> maxWeight.toString,
stateTag -> state,
spaceTag -> JsonUtils.toJson(cubeTransformation.transformations),
indexedColsTag -> columnsToIndex.mkString(","),
elementCountTag -> rowCount.toString)
For data-skipping to be optimal, we may need to collect information on another columns of interest (columns indexed..) Qbeast reading protocol can benefit for this stats in order to skip certain blocks that are not necessary for the query.
Just to update and clarify this issue, the information stored could be something like minValue and maxValue per column. The approach for a solution could be similar to the one we used in #30: Updating the values when writing rows in blocks:
https://github.com/Qbeast-io/qbeast-spark/blob/b72168450a085f856fd8125f58e700944bf78508/src/main/scala/io/qbeast/spark/sql/qbeast/BlockStats.scala#L30-L31
UPDATE
With the release of Delta v1.2.0 , they include support for data skipping using column statistics. That means that statistical information of the columns is gathered in order to perform a finer data skipping technique.
Aside from https://github.com/Qbeast-io/qbeast-spark/issues/98, this is another major improvement that is relevant to the qbeast-spark project. In this case, we should:
Upgrade to the newest version of Delta.
- Solve compatibility problems.
- Understand what statistics are gathered and what are missing.
- Understand how those statistics are used for file/data skipping.
- Possibility: use generated columns to address the weight min/max. Would it make sense? What are the limitations?
- Implement new functionalities if needed.
- Add tests
Merged in #139