qbeast-spark icon indicating copy to clipboard operation
qbeast-spark copied to clipboard

Make files without Metadata readable with Qbeast

Open osopardo1 opened this issue 3 years ago • 4 comments

To be more compatible with underlying Table Formats and set up an easier conversion to Qbeast, we should be able to process files that do not have any Qbeast Metadata on them.

For example

This is a File with Qbeast Metadata:

{
  "add": {
    "path": "d54ba0cd-c315-4388-9bce-fe573f5d0a64.parquet",
    ...
    "tags": {
      "state": "FLOODED",
      "cube": "gw",
      "revision": "1",
      "elementCount": "10836",
      "minWeight": "-1253864150",
      "maxWeight": "1254740128"
    }
  }
}

And this is a file without Qbeast Metadata:

{
  "add": {
    "path": "d54ba0cd-c315-4388-9bce-fe573f5d0a64.parquet",
    ...
    "tags": ""
}

One solution could be the following:

When reading the Delta Log and encountering a file with tags, we put the following synthetic metadata:

val rootTags = Map(
                "maxWeight" -> Weight.MaxValue.value.toString,
                "minWeight" -> Weight.MinValue.value.toString,
                "cube" -> "",
                "state" -> State.FLOODED,
                "revision" -> lastRevisionID.toString,
                "elementCount" -> "0")

This means we are putting all the unknown files onto the last revision root cube with a weight range of [MinValue, MaxValue] ([0.0, 1.0]).

Questions/design decisions:

  • What happens with elementCount? Is it necessary to know the value? If so, how can we compute it without wasting time?
  • Is it fair to use the last revision as a placeholder for this data? Would it be better to choose a non-existing empty revision with ID 0?
  • When optimizing/compating... Should we process those files and convert them to Qbeast?

osopardo1 avatar Jul 22 '22 12:07 osopardo1

Regarding elementCount

  1. If DeltaTable file has Stats, then the value can be obtained as Stats.num_records.
  2. The number of elements in the file is used by sharing protocol to limit the number of records the client can download. In more details the sharing server adds file links to the query result while the sum of elementCount is less then a specified limit.

alexeiakimov avatar Aug 29 '22 21:08 alexeiakimov

Maybe I am wrong the last revision can have (min, max) ranges of the values (later used by liner transformation) which are smaller than the corresponding values ranges of the records from the file. As I remember while indexing if given data does not fit the latest revision space then a new revision is created.

alexeiakimov avatar Aug 29 '22 21:08 alexeiakimov

Let me formulate the last item a different way: can we treat files without Qbeast metadata as indexed? Possibly they are indexed badly, but if they do not violate any invariant, then it is safe to add them to index as if they were indexed.

alexeiakimov avatar Aug 29 '22 21:08 alexeiakimov

  1. On element count, unfortunately, we cannot assume that DeltaTable has stats, but it's a workaround for those cases. If no Stats.num_records is written, we could compute a count() for the file, which would have a cost in performance. Another possible solution is to investigate if Parquet files had metadata we could read and avoid the computation.
  2. We want to be able to Convert to Qbeast without the overhead of indexing, and also let the user use other Lakehouse operations of formats underneath without losing information. Yes, the goal of the issue is to treat them as indexed (badly, as you said), and slowly index them correctly (as the index grows). There can be two cases:
    1. A Revision already exists. In this case, the user had done an operation in Delta that affected the DeltaLog, and now he cannot read it correctly. If we put them in the last revision, we need to ensure those files are in the [min, max] range. But doing this computation at the reading time is too expensive (if we don't have any metadata like Stats). That's why putting them in the last revision without knowledge of the space could violate the constraint.
    2. A Revision does not exist. This is the case in which we convert the table from 0 to Qbeast. Here we have more freedom to write the DeltaLog with extra metadata like min-max and element count. But this process it's more for issue #102

osopardo1 avatar Aug 30 '22 07:08 osopardo1

UPDATE

From last conversations, we agreed that:

  • A table that is fully written in Parquet or Delta would not be readable through Qbeast.
  • The user would need to execute Convert To Qbeast to trigger the creation of the first revision.
  • All files that aren't indexed (the "staging" ones), would be assigned as part of the root of the last revision available. If they don't belong to the space revision, we are going to read them anyways and filter the records in memory (or use file-skipping techniques of the format underlying, in this case, Delta Lake).

This issue is a dependency of #102

osopardo1 avatar Jan 20 '23 09:01 osopardo1

Fixed on #152

osopardo1 avatar Jan 27 '23 13:01 osopardo1