qbeast-spark icon indicating copy to clipboard operation
qbeast-spark copied to clipboard

Metadata time in queries with Qbeast Datasource is higher than expected

Open osopardo1 opened this issue 10 months ago • 0 comments

Investigating in the Spark UI with simple queries, we detected that the Metadata time for Qbeast datasource is bigger than expected.

Here's a comparison of a small (10 element) dataset read with Delta and Parquet:

Parquet

image

Delta

image

Qbeast

image

While Delta an Parquet spent only 2ms on Metadata time, Qbeast wasted 593ms. And this is for a small dataset, but the situation could get worsen specially in high-append scenarios.

I've checked the Execution Plan and the configuration, and does not seem to have much difference asides from the Index used.

  • For Parquet, an InMemoryFileIndex is initialized.
  • For Delta, a PreparedDeltaFileIndex is initialized.
  • For Qbeast a DefaultFileIndex is initialized.

Further investigation is needed. Will keep the conversation going on this issue.

osopardo1 avatar Apr 23 '24 09:04 osopardo1