kafka-tutorials TOMBSTONE Messages : (ksqlDB Tutorial) How to find distinct values in a stream of events

TOMBSTONE Messages : (ksqlDB Tutorial) How to find distinct values in a stream of events

Open bluedog13 opened this issue 4 years ago • 0 comments

For the tutorial below https://kafka-tutorials.confluent.io/finding-distinct-events/ksql.html

If we don't make use of "TIMESTAMP", I see TOMBSTONE Messages. What is the reason for this?

The CLICKS stream is modified to not include the TIMESTAMPA table is created as below

CREATE STREAM CLICKS (IP_ADDRESS STRING, URL STRING)
    WITH (KAFKA_TOPIC = 'CLICKS',
          FORMAT = 'JSON',
          PARTITIONS = 1);

We then insert the values

INSERT INTO CLICKS (IP_ADDRESS, URL) VALUES ('10.0.0.1', 'https://docs.confluent.io/current/tutorials/examples/kubernetes/gke-base/docs/index.html');
INSERT INTO CLICKS (IP_ADDRESS, URL) VALUES ('10.0.0.12', 'https://www.confluent.io/hub/confluentinc/kafka-connect-datagen');
INSERT INTO CLICKS (IP_ADDRESS, URL) VALUES ('10.0.0.13', 'https://www.confluent.io/hub/confluentinc/kafka-connect-datagen');

INSERT INTO CLICKS (IP_ADDRESS, URL) VALUES ('10.0.0.1', 'https://docs.confluent.io/current/tutorials/examples/kubernetes/gke-base/docs/index.html');
INSERT INTO CLICKS (IP_ADDRESS, URL) VALUES ('10.0.0.12', 'https://www.confluent.io/hub/confluentinc/kafka-connect-datagen');
INSERT INTO CLICKS (IP_ADDRESS, URL) VALUES ('10.0.0.13', 'https://www.confluent.io/hub/confluentinc/kafka-connect-datagen');

We can then query the STREAM to see duplicate clicks

ksql> SELECT
   IP_ADDRESS,
   URL
FROM CLICKS 
GROUP BY IP_ADDRESS, URL
HAVING COUNT(IP_ADDRESS) = 1
EMIT CHANGES;

+----------------------------------------------------------------+----------------------------------------------------------------+
|IP_ADDRESS                                                      |URL                                                             |
+----------------------------------------------------------------+----------------------------------------------------------------+
|10.0.0.1                                                        |https://docs.confluent.io/current/tutorials/examples/kubernetes/|
|                                                                |gke-base/docs/index.html                                        |
|10.0.0.12                                                       |https://www.confluent.io/hub/confluentinc/kafka-connect-datagen |
|10.0.0.13                                                       |https://www.confluent.io/hub/confluentinc/kafka-connect-datagen |
|10.0.0.1                                                        |https://docs.confluent.io/current/tutorials/examples/kubernetes/|
|                                                                |gke-base/docs/index.html                                        |
|10.0.0.12                                                       |https://www.confluent.io/hub/confluentinc/kafka-connect-datagen |
|10.0.0.13                                                       |https://www.confluent.io/hub/confluentinc/kafka-connect-datagen |

We can also create the table as per the lesson and query on the table to find the TOMBSTONE messages.

CREATE TABLE DETECTED_CLICKS AS
    SELECT
        IP_ADDRESS AS KEY1,
        URL AS KEY2,
        AS_VALUE(IP_ADDRESS) AS IP_ADDRESS,
        AS_VALUE(URL) AS URL
    FROM CLICKS 
    GROUP BY IP_ADDRESS, URL
    HAVING COUNT(IP_ADDRESS) = 1;

SELECT * FROM DETECTED_CLICKS EMIT CHANGES;

+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
|KEY1                           |KEY2                           |IP_ADDRESS                     |URL                            |
+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
|10.0.0.1                       |https://docs.confluent.io/curre|10.0.0.1                       |https://docs.confluent.io/curre|
|                               |nt/tutorials/examples/kubernete|                               |nt/tutorials/examples/kubernete|
|                               |s/gke-base/docs/index.html     |                               |s/gke-base/docs/index.html     |
|10.0.0.12                      |https://www.confluent.io/hub/co|10.0.0.12                      |https://www.confluent.io/hub/co|
|                               |nfluentinc/kafka-connect-datage|                               |nfluentinc/kafka-connect-datage|
|                               |n                              |                               |n                              |
|10.0.0.13                      |https://www.confluent.io/hub/co|10.0.0.13                      |https://www.confluent.io/hub/co|
|                               |nfluentinc/kafka-connect-datage|                               |nfluentinc/kafka-connect-datage|
|                               |n                              |                               |n                              |
|10.0.0.1                       |https://docs.confluent.io/curre|<TOMBSTONE>                    |<TOMBSTONE>                    |
|                               |nt/tutorials/examples/kubernete|                               |                               |
|                               |s/gke-base/docs/index.html     |                               |                               |
|10.0.0.12                      |https://www.confluent.io/hub/co|<TOMBSTONE>                    |<TOMBSTONE>                    |
|                               |nfluentinc/kafka-connect-datage|                               |                               |
|                               |n                              |                               |                               |
|10.0.0.13                      |https://www.confluent.io/hub/co|<TOMBSTONE>                    |<TOMBSTONE>                    |
|                               |nfluentinc/kafka-connect-datage|                               |                               |
|                               |n                              |                               |                               |

What is the magic behind the TIMESTAMP field that leads to a total different behavior that we see in the lesson explanation?

Much appreciated.

Sep 17 '21 08:09 bluedog13

kafka-tutorials kafka-tutorials copied to clipboard

TOMBSTONE Messages : (ksqlDB Tutorial) How to find distinct values in a stream of events

kafka-tutorials
kafka-tutorials copied to clipboard