kafka-connect-spooldir
kafka-connect-spooldir copied to clipboard
[Question] Recommended config for CSV without primary key
What is the recommended source config if loading CSV into Kafka without unique primary keys? This will be consumed by postgres and other sinks
@scheung38 Keys do not need to be unique. If you get two keys that are the same you will still get two records into Kafka. They will end up in the same partition.
For example:
So given this scenario, any recommended config? if there are no keys then later on making a KSQL KTable will require primary key? But how to make each row identifiable when updates occur later if no unique key?
I'm not sure on the KSQL part. I'd have to look into that. I think it doesn't support compound keys so you might need to use a single message transform to make just "order-01" your key. If you are looking for the aggregate view of order-01 you might need to use Kafka Streams instead and build a hierarchy.
Then could we hash each row on the fly to provide uniqueness, better to write UDF that hashes based on several columns, say C, D, E for hash1 and E, F, G for hash2.
But just from one field say 'order-01' wont provide sufficient uniqueness, require several fields.
Maybe. Given I don't fully understand this data I can't tell you for sure. You could also build a single message transform that concatenates a few fields. Think order-01:8:2020, etc. I personally would want to aggregate this to another topic that is keyed by order-1 with an array of the other content. something like key: order-1, value: [{},{},{}] where values are the combination of those rows. That would give me all the data of an order in a single record
This is just a tiny sample, will get large amount of data. Other than the first two columns the rest of the fields should be different enough to generate unique hash as primary key for each row. Why put into new topic? Could we not append hash as new column?