ksql icon indicating copy to clipboard operation
ksql copied to clipboard

ksqlDB should support N-Way joins on the same table multiple times

Open vrudenskyi opened this issue 3 years ago • 9 comments

Describe the bug N-Way Join for the same table does not work

To Reproduce Steps to reproduce the behavior, include:

  1. The version of KSQL: Server v0.12.0
  2. Sample source data: n/a
  3. Any SQL statements you ran
CREATE STREAM joined AS 
  SELECT * 
  FROM A
    LEFT JOIN B b1 ON A.b1_id = b1.id
    LEFT JOIN B b2 ON A.b2_id = b2.id;

Expected behavior Stream created

Actual behaviour

CLI output/Error messages: Can not join 'B' to 'B': self joins are not yet supported.

Additional context n/a

vrudenskyi avatar Oct 21 '20 20:10 vrudenskyi

Hi @vrudenskyi , the behavior you're requesting is not possible in ksqlDB today as Kafka Streams (the underlying runtime) does not support it, and workarounds would require duplicating data in wasteful ways. Here are some related issues:

  • Kafka Streams does not support reading the same topic multiple times: https://issues.apache.org/jira/browse/KAFKA-6687
  • ksqlDB does not support self-joins for the same reason: https://github.com/confluentinc/ksql/issues/2030

(cc @mjsax in case something has changed since the filing of those issues)

I'm going to update this issue from bug to enhancement accordingly.

vcrfxia avatar Oct 22 '20 22:10 vcrfxia

@vcrfxia - I think the one thing that's interesting is that this isn't technically a self-join since there is no loop (though the limitation in KS probably still applies here):

  A
  |
  |--B
  |
  AB
  |
  |--B
  |
  ABB

We might want to change the error message to "n-way joins do not support reading the same topic multiple times"

agavra avatar Oct 23 '20 00:10 agavra

As the statement is CSAS it seems that A is a stream and B is a table. For this case, if should be possible to actually compile the query? It's not necessary to read B twice.

KTable b = builder.table("bTopic");
KStream A = builder.stream("aTopic");
a.selectKey(b1_id).join(b).select(b2_id).join(b);

I guess our check if a stream or table is referenced twice is rather "primitive" (I guess count based?).

mjsax avatar Oct 23 '20 01:10 mjsax

We might want to change the error message to "n-way joins do not support reading the same topic multiple times"

Sure, I've opened a PR to clarify the error message: https://github.com/confluentinc/ksql/pull/6527

vcrfxia avatar Oct 27 '20 06:10 vcrfxia

(To help others find this issue, the latest (0.14) version of the message is N-way joins do not support multiple occurrences of the same source)


I also tried duplicating the table definition to a new object against the same source topic, but ksqlDB is too clever for me:

Invalid topology: Topic TIPLOC_FLAT_KEYED has already been registered by another source.

I'd like to +1 making this possible - it's a common scenario in analytics to use the same lookup table multiple times, for example:

  • Date reference (business period of multiple dates in a stream, e..g "ship date", "close date", etc)
  • Location reference (decoding a location code to a place, country, etc for multiple locations e.g. "billing address", "delivery address", etc)

From the sounds of it this is actually possible today in Kafka Streams, but ksqlDB does not permit it - so hopefully it's a simple fix? 👼 /cc @MichaelDrogalis @derekjn

rmoff avatar Jan 05 '21 14:01 rmoff

We are hitting this issue as @rmoff mentioned in our reporting environment. Are there any alternatives or update on this issue? Thanks

dttouchdata avatar Jun 10 '22 01:06 dttouchdata

Hello @vcrfxia the upstream issue you referenced Kafka Streams is now Resolved. Can the fix also be implemented in ksqldb?

Ktl-XV avatar Aug 18 '22 21:08 Ktl-XV

@vpapavas @jnh5y you started looking into https://github.com/confluentinc/ksql/issues/2030 right? Is this ticket (for n-way joins) also in scope for what you're looking at?

vcrfxia avatar Aug 19 '22 16:08 vcrfxia

Hey @vcrfxia! N-way joins are not really in scope, nor are table-table self-joins. We might though get table-table self-joins as a side bonus if they are easy to support :)

vpapavas avatar Aug 25 '22 08:08 vpapavas

Any updates on this ? Or any workaround for this as we are investigating usage of another tool due to this reason

killermoon avatar Apr 18 '23 12:04 killermoon

Self-join optimization was done in KS in the meant-time: https://cwiki.apache.org/confluence/display/KAFKA/KIP-862%3A+Self-join+optimization+for+stream-stream+joins

But ksqlDB does not yet leverage it.

mjsax avatar Apr 18 '23 16:04 mjsax

Self-join optimization was done in KS in the meant-time: https://cwiki.apache.org/confluence/display/KAFKA/KIP-862%3A+Self-join+optimization+for+stream-stream+joins

But ksqlDB does not yet leverage it.

Fine but we only use KsqlDb in order to map, structure and deliver our data trough KSQL.

killermoon avatar Apr 18 '23 16:04 killermoon