qbeast-spark icon indicating copy to clipboard operation
qbeast-spark copied to clipboard

Add min-max column information

Open osopardo1 opened this issue 4 years ago • 2 comments

Right now we add block information on different metrics like cube, weight and state onto the delta commit log.

val tags = Map(
            cubeTag -> cube,
            weightMinTag -> minWeight.toString,
            weightMaxTag -> maxWeight.toString,
            stateTag -> state,
            spaceTag -> JsonUtils.toJson(cubeTransformation.transformations),
            indexedColsTag -> columnsToIndex.mkString(","),
            elementCountTag -> rowCount.toString)

For data-skipping to be optimal, we may need to collect information on another columns of interest (columns indexed..) Qbeast reading protocol can benefit for this stats in order to skip certain blocks that are not necessary for the query.

osopardo1 avatar Aug 17 '21 09:08 osopardo1

Just to update and clarify this issue, the information stored could be something like minValue and maxValue per column. The approach for a solution could be similar to the one we used in #30: Updating the values when writing rows in blocks: https://github.com/Qbeast-io/qbeast-spark/blob/b72168450a085f856fd8125f58e700944bf78508/src/main/scala/io/qbeast/spark/sql/qbeast/BlockStats.scala#L30-L31

eavilaes avatar Oct 28 '21 14:10 eavilaes

UPDATE

With the release of Delta v1.2.0 , they include support for data skipping using column statistics. That means that statistical information of the columns is gathered in order to perform a finer data skipping technique.

Aside from https://github.com/Qbeast-io/qbeast-spark/issues/98, this is another major improvement that is relevant to the qbeast-spark project. In this case, we should:

Upgrade to the newest version of Delta.

  1. Solve compatibility problems.
  2. Understand what statistics are gathered and what are missing.
  3. Understand how those statistics are used for file/data skipping.
  4. Possibility: use generated columns to address the weight min/max. Would it make sense? What are the limitations?
  5. Implement new functionalities if needed.
  6. Add tests

osopardo1 avatar Apr 25 '22 09:04 osopardo1

Merged in #139

osopardo1 avatar Jan 19 '23 15:01 osopardo1