ksql
ksql copied to clipboard
ksqlDB should support N-Way joins on the same table multiple times
Describe the bug N-Way Join for the same table does not work
To Reproduce Steps to reproduce the behavior, include:
- The version of KSQL:
Server v0.12.0
- Sample source data:
n/a
- Any SQL statements you ran
CREATE STREAM joined AS
SELECT *
FROM A
LEFT JOIN B b1 ON A.b1_id = b1.id
LEFT JOIN B b2 ON A.b2_id = b2.id;
Expected behavior Stream created
Actual behaviour
CLI output/Error messages:
Can not join 'B' to 'B': self joins are not yet supported.
Additional context
n/a
Hi @vrudenskyi , the behavior you're requesting is not possible in ksqlDB today as Kafka Streams (the underlying runtime) does not support it, and workarounds would require duplicating data in wasteful ways. Here are some related issues:
- Kafka Streams does not support reading the same topic multiple times: https://issues.apache.org/jira/browse/KAFKA-6687
- ksqlDB does not support self-joins for the same reason: https://github.com/confluentinc/ksql/issues/2030
(cc @mjsax in case something has changed since the filing of those issues)
I'm going to update this issue from bug
to enhancement
accordingly.
@vcrfxia - I think the one thing that's interesting is that this isn't technically a self-join since there is no loop (though the limitation in KS probably still applies here):
A
|
|--B
|
AB
|
|--B
|
ABB
We might want to change the error message to "n-way joins do not support reading the same topic multiple times"
As the statement is CSAS
it seems that A
is a stream and B
is a table. For this case, if should be possible to actually compile the query? It's not necessary to read B
twice.
KTable b = builder.table("bTopic");
KStream A = builder.stream("aTopic");
a.selectKey(b1_id).join(b).select(b2_id).join(b);
I guess our check if a stream or table is referenced twice is rather "primitive" (I guess count based?).
We might want to change the error message to "n-way joins do not support reading the same topic multiple times"
Sure, I've opened a PR to clarify the error message: https://github.com/confluentinc/ksql/pull/6527
(To help others find this issue, the latest (0.14) version of the message is N-way joins do not support multiple occurrences of the same source
)
I also tried duplicating the table definition to a new object against the same source topic, but ksqlDB is too clever for me:
Invalid topology: Topic TIPLOC_FLAT_KEYED has already been registered by another source.
I'd like to +1 making this possible - it's a common scenario in analytics to use the same lookup table multiple times, for example:
- Date reference (business period of multiple dates in a stream, e..g "ship date", "close date", etc)
- Location reference (decoding a location code to a place, country, etc for multiple locations e.g. "billing address", "delivery address", etc)
From the sounds of it this is actually possible today in Kafka Streams, but ksqlDB does not permit it - so hopefully it's a simple fix? 👼 /cc @MichaelDrogalis @derekjn
We are hitting this issue as @rmoff mentioned in our reporting environment. Are there any alternatives or update on this issue? Thanks
Hello @vcrfxia the upstream issue you referenced Kafka Streams is now Resolved. Can the fix also be implemented in ksqldb?
@vpapavas @jnh5y you started looking into https://github.com/confluentinc/ksql/issues/2030 right? Is this ticket (for n-way joins) also in scope for what you're looking at?
Hey @vcrfxia! N-way joins are not really in scope, nor are table-table self-joins. We might though get table-table self-joins as a side bonus if they are easy to support :)
Any updates on this ? Or any workaround for this as we are investigating usage of another tool due to this reason
Self-join optimization was done in KS in the meant-time: https://cwiki.apache.org/confluence/display/KAFKA/KIP-862%3A+Self-join+optimization+for+stream-stream+joins
But ksqlDB does not yet leverage it.
Self-join optimization was done in KS in the meant-time: https://cwiki.apache.org/confluence/display/KAFKA/KIP-862%3A+Self-join+optimization+for+stream-stream+joins
But ksqlDB does not yet leverage it.
Fine but we only use KsqlDb in order to map, structure and deliver our data trough KSQL.