kafka-connect-spooldir icon indicating copy to clipboard operation
kafka-connect-spooldir copied to clipboard

enhancement request

Open georgelza opened this issue 5 years ago • 5 comments

It would be great to be able to specify the key to the messages (events) posted. or at least specify values that can be pushed into the headers of the messages, if in the headers then SMT can be used to enrich or key the msg down the line.

ie, file name - where a directory is monitored, all messages per file might need to be processed in order, and identified as originating from a single file. server name. might have multiple server where the same directory is monitored/shared, being able to key the files loaded per server. fix text, could be, that multiple systems is writing log or output to a common directory on a server, by being able to configure multiple the spool dirs to monitor a common directory... potentially each with a specific regular expression of which files to read/ingest, by then allowing a fixed text to be used as the key, or be pushed into the header, the source can be identified better.

My thinking would be values specified / pushed as tags into the header, that can then SMT into the message, as either the key or part of the data as a repack from a single line into a JSON structure with the header tags as values in the message.

and PS: I really need this, but I'm useless with Java, so hoping someone would agree this would be an great enhancement.

G

georgelza avatar Feb 03 '20 18:02 georgelza

@georgelza That feature is available in the 2.0 branch. It's going to be released to the Confluent hub soon. For now you could pull it down and build it with mvn clean package For inserting the server name I would do that in a transformation. jcustenborder/kafka-connect-transformation-common#56

jcustenborder avatar Feb 06 '20 16:02 jcustenborder

#116

jcustenborder avatar Feb 06 '20 16:02 jcustenborder

glad to hear about the features and 2.0 you say that feature, assume you refer to the ability to add keys to events? The server name, will have to rethink, I'm aware of SMT, but there was a specific reason why I was hoping they key can be key.value = {hostname} and why I figured SMT won't work. thanks.

georgelza avatar Feb 06 '20 17:02 georgelza

SMTs are cool because you can do whatever you want the record. You receive a record and emit a record. This connector will only key data based on the data in the files it processes. Unfortunately I don't have a way to interject the server name. Question for you though. Why would you want the data to be keyed by the hostname? Is it logging data specific to that host? The SMT allows you to do whatever you want to a record produced.

jcustenborder avatar Feb 06 '20 17:02 jcustenborder

Clusters environments ad source writing to a shared NFS, each host with a connect node, writing this stream to the same topic.... would/could write for each connect job/definition as you say a SMT, that gets events/lines and SMT injects the host name into msg/key before outputting it to the common topic, but this means hard coding a value into the SMT step, and if I need to add more source nodes, it means more hard coding, instead of being able to make the hostname a variable that goes into the source json input. as said, remember I had very good reason... and one by one disqualified the other ideas.

You say "This connector will only key data based on the data in the files it processes." this implies I would not be able to have one connector, looking at a directory, and key all the messages of a single file with the name of the file (this is really critical) if I can't have this feature then I'm going to have to write a file watcher and read/publisher myself, nullifying the connector value.

georgelza avatar Feb 06 '20 17:02 georgelza