qbeast-spark icon indicating copy to clipboard operation
qbeast-spark copied to clipboard

Make Blocks addressable from the file reader

Open osopardo1 opened this issue 9 months ago • 0 comments

From v0.6.0 onwards, the structure of the Table is composed by files that contain multiple blocks, each of them belonging to the same or different cubes. This is part of the Multiblock format, that allowed Qbeast to balance the file layout without losing indexing benefits.

Now, blocks help us locate a particular cube on the file, but a single block is not addressable/retrievable from the spark reader. Although we are using Delta File Skipping to discard data based on min/max, we are not supporting such fine-grained search when Sampling is applied.

This change requires some work regarding #175 . Datasource V2 is more extensible and allows us to implement our reader. In this case, the reader should be designed to skip entire groups of rows based on the block number.

PS: This is something that @alexeiakimov had tried in previous issues, but some other priorities were raised.

TODOs:

  • [ ] Analyze how to make blocks addressable from a Parquet File.
  • [ ] Implement Datasource V2 for Qbeast
  • [ ] Make a PoC
  • [ ] Develop the feature and test

osopardo1 avatar Apr 25 '24 06:04 osopardo1