kcat
kcat copied to clipboard
Piping consumer -> producer `kafkacat` for binary data
Hi,
Sometimes (usually for test purposes) I want to quickly copy a certain number of messages from one topic to another. If the messages are text, I can use kafkacat
to do something like
kafkacat -C -b localhost -t source -c 10 -e | kafkacat -P -b localhost -t dest
This doesn't seem to work when the messages are binary, though, as any appearance of \n
in the message byte stream will be interpreted as a delimiter by the producer.
I tried changing the delimiter to something unlikely to be found in the bytestream (e.g., an equivalent of -D "__ENDMESSAGE__"
). The consumer understands this nicely, but it looks like the producer just takes the first character passed with the -D
option, not the full string.
My current workaround is to set a delimiter on the consumer end with -D
, then pipe the output to awk
to write it out into a set of temporary files (one per-message), and then run the producer with the list of those files (see here for the resulting shell script). This requires the files to hit the disk, though, and isn't quite as elegant as it could be.
My question -- is there a better/easier way to do this? If not, are there obstacles to extending -D
in producer mode to take a string as a delimiter, rather than a single character?
I have this issue too, and I've found a PR that is supposed to allow for multi byte delimiters on the consumer side (https://github.com/edenhill/kafkacat/pull/150). However, it hasn't been merged yet and for me the fork segfaults.
Hi! Thx for sharing the shell script @mjuric . I might have missed something but for me it's not working.
kafkacat -C -e -b localhost -t "$SRC" -D "$DELIM"
I just wonder, if the -D
does not support multi-byte delimiters, how would that work?
It supports multi-byte delimiters in consumer mode, but not in when producing. At least that was the state as of a year ago when I wrote the hack, I'm not sure whether something changed in the meantime?
Maybe a more general solution to this problem is to allow loading/dumping data in base64 format. It is inefficient, but robust and easy to implement.
The other approach I was hoping would work would be to use the JSON wrapper, but the Producer doesn't seem to unwrap from it. If I do:
kafkacat -b mybroker -C -t mytopic -o $START -c $COUNT -J
The binary gets encoded into newline delimited JSON messages. But if I then pipe that into a producer with:
kafkacat -b mybroker -C -t mytopic -o $START -c $COUNT -J | kafkacat -b mybroker -P -t mytopic2 -J
The messages on mytopic2 are still JSON encoded. If the producer honored the -J flag, ignoring the topic, partition, offset, and timestamp while producing based on the decoded key and payload, you'd be able to handle binary data accordingly.
Also, I'm not sure a multicharacter delimiter would be a huge help in my case, as I'm not sure I can identify a sequence of characters that could not appear in my messages.
It has been implemented on both consumer and producer as
kafkacat ..... -K "KEY_DELIM" -D "ITEM_DELIM"
check https://github.com/edenhill/kafkacat/blob/master/tests/0002-delim.sh
I have created a pull request for this. Vote for me. #295
Bump. It's now 2023 and this is still not possible? It should be a no-brainer.