Kafka Ingestion should Support 1 to 0 or 1 to N relationships between input data and resulting stream
As a user, I want to not be limited to 1-1 matching between my input ConsumerRecord and the output in my table.
A simple example is that I may want to examine the record in a shallow manner (e.g., key only) and filter out keys that do not match before doing fuller parsing. This would result in a 1 input record to 0 output record situation.
Another example is I may want to process JSON objects that have parallel arrays, which I want to unroll into multiple output rows during ingestion.
Potentially similar to https://github.com/deephaven/deephaven-core/issues/2753.
I could also imagine situations where you want a single kafka stream to produce outputs to different tables depending on context.
This is potentially a common use case w/ type-discriminated json:
{
"type": "quote",
"symbol": "Foo",
"bid": 1.01,
"ask": 1.05
}
{
"type": "trade",
"symbol": "Bar",
"price": 42.42
}
avro and protobuf both have union-types where this pattern might be common too.
Instead of producing a single table,
-
[type: String, symbol: String, bid: double, ask: double, price: double]
you might want two tables
-
[symbol: String, bid: double, ask: double] -
[symbol: String, price: double]
You could imagine more complex cases where the names overlap and have differing types, so you wouldn't be able to efficiently emulate it with a single table version.
There are also cases where the customer wants "nested" tables.
A stylized example of a request we had for Solace was: {type: "complex order" id: 1234 legs: [ {id: 1234_1, cause_data: [{a: 1}, {a: 2}]}, {id: 1234_2, cause_data: [{a: 3}]}, ] }
Instead of unrolling everything everywhere, they would want a table for orders, legs, and then cause_data with foreign key references between the tables.
Yes, I've had discussions regarding exactly this idea wrt repeated / nested parquet types. I think there is a really interesting way we could consider building this, with X physical tables (where X is the number of unique nested schemas +1, give or take depending on impl), and you get back one result table, where the repeated elements are represented as a column of type Table (and so on, that column may have further repeated elements represented as column of type Table, etc).
The logical results are similar to what you could get if you had the X physical tables and back-constructed joins with the keys; but when you are building this structure from "pre-joined" event data, it doesn't make sense to me to want to destructure it just to rejoin it w/ joins. I think Table#getSubTable is potentially the relevant construction when imagining building up this recursive Table impl structure.