qbeast-spark Add min-max column information

Right now we add block information on different metrics like cube, weight and state onto the delta commit log.

val tags = Map(
            cubeTag -> cube,
            weightMinTag -> minWeight.toString,
            weightMaxTag -> maxWeight.toString,
            stateTag -> state,
            spaceTag -> JsonUtils.toJson(cubeTransformation.transformations),
            indexedColsTag -> columnsToIndex.mkString(","),
            elementCountTag -> rowCount.toString)

For data-skipping to be optimal, we may need to collect information on another columns of interest (columns indexed..) Qbeast reading protocol can benefit for this stats in order to skip certain blocks that are not necessary for the query.

Aug 17 '21 09:08 osopardo1

Just to update and clarify this issue, the information stored could be something like minValue and maxValue per column. The approach for a solution could be similar to the one we used in #30: Updating the values when writing rows in blocks: https://github.com/Qbeast-io/qbeast-spark/blob/b72168450a085f856fd8125f58e700944bf78508/src/main/scala/io/qbeast/spark/sql/qbeast/BlockStats.scala#L30-L31

Oct 28 '21 14:10 eavilaes

UPDATE

With the release of Delta v1.2.0 , they include support for data skipping using column statistics. That means that statistical information of the columns is gathered in order to perform a finer data skipping technique.

Aside from https://github.com/Qbeast-io/qbeast-spark/issues/98, this is another major improvement that is relevant to the qbeast-spark project. In this case, we should:

Upgrade to the newest version of Delta.

Solve compatibility problems.
Understand what statistics are gathered and what are missing.
Understand how those statistics are used for file/data skipping.
Possibility: use generated columns to address the weight min/max. Would it make sense? What are the limitations?
Implement new functionalities if needed.
Add tests

Apr 25 '22 09:04 osopardo1

Merged in #139

Jan 19 '23 15:01 osopardo1

qbeast-spark qbeast-spark copied to clipboard

Add min-max column information

qbeast-spark
qbeast-spark copied to clipboard