kcat icon indicating copy to clipboard operation
kcat copied to clipboard

Piping consumer -> producer `kafkacat` for binary data

Open mjuric opened this issue 6 years ago • 9 comments

Hi, Sometimes (usually for test purposes) I want to quickly copy a certain number of messages from one topic to another. If the messages are text, I can use kafkacat to do something like

kafkacat -C -b localhost -t source -c 10 -e | kafkacat -P -b localhost -t dest

This doesn't seem to work when the messages are binary, though, as any appearance of \n in the message byte stream will be interpreted as a delimiter by the producer.

I tried changing the delimiter to something unlikely to be found in the bytestream (e.g., an equivalent of -D "__ENDMESSAGE__"). The consumer understands this nicely, but it looks like the producer just takes the first character passed with the -D option, not the full string.

My current workaround is to set a delimiter on the consumer end with -D, then pipe the output to awk to write it out into a set of temporary files (one per-message), and then run the producer with the list of those files (see here for the resulting shell script). This requires the files to hit the disk, though, and isn't quite as elegant as it could be.

My question -- is there a better/easier way to do this? If not, are there obstacles to extending -D in producer mode to take a string as a delimiter, rather than a single character?

mjuric avatar Jun 06 '18 00:06 mjuric

I have this issue too, and I've found a PR that is supposed to allow for multi byte delimiters on the consumer side (https://github.com/edenhill/kafkacat/pull/150). However, it hasn't been merged yet and for me the fork segfaults.

ohemelaar avatar Jul 08 '19 07:07 ohemelaar

Hi! Thx for sharing the shell script @mjuric . I might have missed something but for me it's not working.

kafkacat -C -e -b localhost -t "$SRC" -D "$DELIM"

I just wonder, if the -D does not support multi-byte delimiters, how would that work?

thomasklein avatar Aug 21 '19 12:08 thomasklein

It supports multi-byte delimiters in consumer mode, but not in when producing. At least that was the state as of a year ago when I wrote the hack, I'm not sure whether something changed in the meantime?

mjuric avatar Aug 21 '19 16:08 mjuric

Maybe a more general solution to this problem is to allow loading/dumping data in base64 format. It is inefficient, but robust and easy to implement.

igorcalabria avatar Jan 08 '20 17:01 igorcalabria

The other approach I was hoping would work would be to use the JSON wrapper, but the Producer doesn't seem to unwrap from it. If I do:

kafkacat -b mybroker -C -t mytopic -o $START -c $COUNT -J

The binary gets encoded into newline delimited JSON messages. But if I then pipe that into a producer with:

kafkacat -b mybroker -C -t mytopic -o $START -c $COUNT -J | kafkacat -b mybroker -P -t mytopic2 -J

The messages on mytopic2 are still JSON encoded. If the producer honored the -J flag, ignoring the topic, partition, offset, and timestamp while producing based on the decoded key and payload, you'd be able to handle binary data accordingly.

AusIV avatar Oct 28 '20 16:10 AusIV

Also, I'm not sure a multicharacter delimiter would be a huge help in my case, as I'm not sure I can identify a sequence of characters that could not appear in my messages.

AusIV avatar Oct 28 '20 16:10 AusIV

It has been implemented on both consumer and producer as kafkacat ..... -K "KEY_DELIM" -D "ITEM_DELIM" check https://github.com/edenhill/kafkacat/blob/master/tests/0002-delim.sh

JakkuSakura avatar Feb 20 '21 17:02 JakkuSakura

I have created a pull request for this. Vote for me. #295

JakkuSakura avatar Feb 21 '21 06:02 JakkuSakura

Bump. It's now 2023 and this is still not possible? It should be a no-brainer.

MartinSoto avatar May 10 '23 10:05 MartinSoto