pinot Support Structured, Repeated Records

As of today, the JSON index is an inverted index and can't be further configured by the user via schema settings. It would be very powerful to support the ability to specify different data types and indexes for fields of JSON columns. Using the example from the JSON Docs[1]:

{
  "name": "adam",
  "age": 30,
  "country": "us",
  "addresses":
  [
    {
      "number" : 112,
      "street" : "main st",
      "country" : "us"
    },
    {
      "number" : 2,
      "street" : "second st",
      "country" : "us"
    },
    {
      "number" : 3,
      "street" : "third st",
      "country" : "ca"
    }
  ]
}

For example, having the ability to specify that number is an int and building a range index would allow for range queries. The street address number is a contrived example that's not really practical, but it gets the point across.

Adding a range index could be done today using an ingestion transform. But adding a range index to JSON data becomes very powerful when combined with the idea that JSON context can be maintained when using JSON_MATCH.

Ex.

SELECT ... 
FROM mytable 
WHERE JSON_MATCH(person, '"$.addresses[*].number"<=2 AND "$.addresses[*].country"=''ca''')

This would not match the above adam record, because within the context of the JSON object itself there is no satisfactory result where both predicates are true. Using an ingestion transform cannot reproduce this functionality. Note that range index is a single example, and ideally all types of indexes would be supported for this type of semi-structured JSON data with a predictable schema.

[1] https://docs.pinot.apache.org/basics/indexing/json-index#chained-key-lookup

Jan 06 '22 22:01 egalpin

Note this may be related to #7863

Jan 06 '22 22:01 egalpin

To configure index for a sub-field of a json field, we can first extract the sub-field as a separate field through ingestion transform, then configure index on the separate field.

JSON index is designed for cases where schema is not fixed across different records, and it is not as efficient as regular inverted index because we don't know the data type of the sub-fields. In your example, since we already know the data type of the sub-field, we should extract it as a separate field to get better performance and be able to configure all indexes. We may add some helper config to make the sub-field extraction more smooth (#7863), but IMO we don't need to support all index types within the JSON_MATCH.

Apr 06 '22 21:04 Jackie-Jiang

@Jackie-Jiang extracting a field is definitely a good approach for some use cases. Perhaps JSON is a bit of a red herring; what I really would like to see supported is columns allowing repeated records that have a fixed schema. Effectively, a column which itself is a table, or sometimes referred to as nested data types.

One issue with the idea of extracting values to other columns is that context is lost. If I want to perform conjunctions on records where multiple fields within a single record must match query predicates (which JSON_MATCH supports), that can’t be done by extracting data from a single field to another column.

Thoughts?

Apr 06 '22 23:04 egalpin

@egalpin I see the point. Essentially we want to support the arbitrary nested array/map with fixed schema, and allow index on top of them. This is a very interesting and challenging problem. @richardstartin might be able to share more thoughts?

Apr 08 '22 19:04 Jackie-Jiang

I was actually considering using IN_PARTITIONED_SUBQUERY as a way to implement rudimentary nested column support. I think this could potentially be implemented natively without the need for IN_PARTITIONED_SUBQUERY in the future as well.

Imagine a table named foo with data organized like so, based on the json index docs:

id	name	age	country	parent_id	addresses.number	addresses.street	addresses.country
1	"adam"	30	"us"	1	null	null	null
null	null	null	null	1	112	"main st"	"us"
null	null	null	null	1	2	"second st"	"us"
null	null	null	null	1	3	"third st"	"ca"

This would allow for rich filtering support using all existing indexing techniques in Pinot with “context” maintained. Ex: “Who has lived in us while having an address number greater than 3?” (adam would match). Could use range index on addresses.number to make filters as efficient as possible. (age filter only present to show mixture of parent and nested filtering)

SELECT name, age, country
FROM foo
WHERE
  IN_PARTITIONED_SUBQUERY(id, 'SELECT ID_SET(parent_id) FROM foo WHERE addresses.number > 3 AND addresses.country = ''us'' ') = 1
  AND age > 10

Whereas a similar query of “Who has lived in ca while having an address number greater than 3?” (adam would not match)

SELECT name, age, country
FROM foo
WHERE
  IN_PARTITIONED_SUBQUERY(id, 'SELECT ID_SET(parent_id) FROM foo WHERE addresses.number > 3 AND addresses.country = ''ca'' ') = 1
  AND age > 10

It would also support querying “across” multiple records for the same parent_id i.e. “bypassing” context. Ex: “Has Adam ever lived in ca , and also lived on main st at some point?”

SELECT name, age, country
FROM foo
WHERE
  IN_PARTITIONED_SUBQUERY(id, 'SELECT ID_SET(parent_id) FROM foo WHERE addresses.country = ''ca''') = 1
  AND IN_PARTITIONED_SUBQUERY(id, 'SELECT ID_SET(parent_id) FROM foo WHERE addresses.street = ''main st''') = 1
  AND age > 10

This pattern would also support efficient aggregation of values from the "nested" column (as compared to something like jsonextractscalar which needs to deserialize every match to accomplish the same). Ex. (note the inversion of id and parent_id in the use of IN_PARTITIONED_SUBQUERY this time, given that we want the result set to include the nested rows as well, and not only the parent rows). This is where it's important that the parent row have id == parent_id so that nested elements can be grouped with their parent (if aggregated).

SELECT parent_id, name, SUM(addresses.number)
FROM foo
WHERE
  IN_PARTITIONED_SUBQUERY(parent_id, 'SELECT ID_SET(id) FROM foo WHERE name = ''adam'' AND age > 10') = 1
GROUP BY parent_id, name

Thoughts on this approach?

Oct 12 '22 17:10 egalpin

It is very smart. There are some extra cares need to be taken:

There should be only one main record for each id, or it might cross match other records, and get unexpected result
To resolve IN_PARTITIONED_SUBQUERY (IN_ID_SET after resolving IdSet), we need to scan all the ids because inverted index won't be applied for IN_ID_SET. When there are very few matching ids, this could be slower than sending 2 queries, where the second query is constructed using the ids from first query in an IN clause.

Ideally we can build this natively to handle these inefficiencies. The subquery can happen at segment level instead of partition level (IN_PARTITIONeD_SUBQUERY) or table level (IN_SUBQUERY)

Oct 17 '22 21:10 Jackie-Jiang

Got it, thanks for the insights!

Do you imagine that–hypothetically–if this were implemented natively, the ability to write the nested records out-of-band could be maintained? I think this would be a very enticing feature, because it could potentially replace partial upserts for some cases (ex. for the use case where upsert is used to allow latent data to be added to an existing row at some later time).

Just to clarify what I mean, imagine that for a Realtime/streaming table, the following two messages are received 5 minutes apart:

{
  "id": 1,
  "name": "adam",
  "age": 30,
  "country": "us"
}

... 5 min later ...

{
  "parent_id": 1,
  "addresses":
  [
    {
      "number" : 112,
      "street" : "main st",
      "country" : "us"
    },
    {
      "number" : 2,
      "street" : "second st",
      "country" : "us"
    },
    {
      "number" : 3,
      "street" : "third st",
      "country" : "ca"
    }
  ]
}

This would only work if there was a primary key, as opposed to using the $docId. Seems challenging to pull off, but if it's at all possible it would be a unique and powerful feature.

Oct 20 '22 19:10 egalpin

Good point! There are some performance tradeoffs for different scope of out-of-band records to maintain. If we maintain them at partition level, it will be very similar to IN_PARTITIONED_SUBQUERY which requires all data on the same server and merging the selected ids from the segments. If we maintain them at segment level, we will get the best performance, but we can only support OFFLINE batch ingestion.

Oct 24 '22 17:10 Jackie-Jiang