kafka-connect-storage-cloud
kafka-connect-storage-cloud copied to clipboard
Added SafeByteArrayFormat to avoid using separator in S3SinkConnector
Background: We are using protobuf as our message schema in kafka. Messages are saved in kafka in byte array format. When we try to backup our message with S3SinkConnector, we found that the ByteArrayFormat separate messages with separator. We hardly find a suitable separator for our messages.
Solution: Therefore we try to follow the suggestion provided by google and create the SafeByteArrayFormat. https://developers.google.com/protocol-buffers/docs/techniques?hl=en
The SafeByteArrayFormat write the size of each message before it write the message itself. When we need to read back the messages from file, we read the first 4 bytes for the length of the message and then read (length of the message) bytes for the message body, repeat til the end.
Limitation: For message size that is larger than Int value, there will be a problem..
@confluentinc It looks like @smallcampus just signed our Contributor License Agreement. :+1:
Always at your service,
clabot
A signed int, with 31 bits, can represent a size of 2GB, thousands of times more than the default max message size.
Alternatively the size can be a varint to save a few bytes for most messages, with no maximum value limitation. The cost is of course a bit of code complexity.
@smallcampus If this is still desired, can please resolve any conflicts?