kafka-connect-spooldir [Question] Recommended config for CSV without primary key

[Question] Recommended config for CSV without primary key

Open scheung38 opened this issue 4 years ago • 6 comments

What is the recommended source config if loading CSV into Kafka without unique primary keys? This will be consumed by postgres and other sinks

Sep 14 '20 23:09 scheung38

@scheung38 Keys do not need to be unique. If you get two keys that are the same you will still get two records into Kafka. They will end up in the same partition.

Sep 14 '20 23:09 jcustenborder

For example:

Screenshot 2020-09-15 at 00 08 14

So given this scenario, any recommended config? if there are no keys then later on making a KSQL KTable will require primary key? But how to make each row identifiable when updates occur later if no unique key?

Sep 14 '20 23:09 scheung38

I'm not sure on the KSQL part. I'd have to look into that. I think it doesn't support compound keys so you might need to use a single message transform to make just "order-01" your key. If you are looking for the aggregate view of order-01 you might need to use Kafka Streams instead and build a hierarchy.

Sep 14 '20 23:09 jcustenborder

Then could we hash each row on the fly to provide uniqueness, better to write UDF that hashes based on several columns, say C, D, E for hash1 and E, F, G for hash2.

But just from one field say 'order-01' wont provide sufficient uniqueness, require several fields.

Sep 14 '20 23:09 scheung38

Maybe. Given I don't fully understand this data I can't tell you for sure. You could also build a single message transform that concatenates a few fields. Think order-01:8:2020, etc. I personally would want to aggregate this to another topic that is keyed by order-1 with an array of the other content. something like key: order-1, value: [{},{},{}] where values are the combination of those rows. That would give me all the data of an order in a single record

Sep 14 '20 23:09 jcustenborder

This is just a tiny sample, will get large amount of data. Other than the first two columns the rest of the fields should be different enough to generate unique hash as primary key for each row. Why put into new topic? Could we not append hash as new column?

Sep 14 '20 23:09 scheung38

kafka-connect-spooldir kafka-connect-spooldir copied to clipboard

[Question] Recommended config for CSV without primary key

kafka-connect-spooldir
kafka-connect-spooldir copied to clipboard