Add support of custom codecs in topics
@alex268 Could you please clarify some thoughts?
-
What I understand: I could write a message in the topic using CODEC ydb.topic.description.Codec.CUSTOM Should we provide the ability for users to write custom encoders and decoders implementing some interface?
-
Should code provide the ability to implement arbitrary count encoders and decoders? For ex.: users create in code custom encoder1, decoder1, encoder2, and decoder2.
Topic 1: Use encoder 1 and decoder 1. Topic 2: Use encoder 2 and decoder 2.
@alex268 Could you please clarify some thoughts?
1. What I understand: I could write a message in the topic using CODEC ydb.topic.description.Codec.CUSTOM Should we provide the ability for users to write custom encoders and decoders implementing some interface? 2. Should code provide the ability to implement arbitrary count encoders and decoders? For ex.: users create in code custom encoder1, decoder1, encoder2, and decoder2.Topic 1: Use encoder 1 and decoder 1. Topic 2: Use encoder 2 and decoder 2.
Right now there is no ability to use custom codecs in Java SDK. The main reason - set of available codecs is limited by enum ydb.topic.description.Codec and readers/writers support only 4 values of codecs - raw, gzip, lzop and zstd. In the future we plan to add a codec registry to the topic client and replace enum by integer constant. In this case you will be enable to register your own codec in the registry and use custom code for this codec
@alex268 also have some questions:
For example, I have two topic orders and places. I want to use the custom LZ4 algorithm in orders and the Brotli algorithm in places.
I look under the hood and find info about the codec stored in YdbTopic.StreamReadMessage.ReadResponse.Batch filed _codec And has value 1-4, 10000 (means custom codec).
For backward compatibility, these values don't change, do they?
And a new class, for example, CodecRegistry, will have something like this: register(topic, interface_custom_codec) means that we chain interface_custom_codec with 1..N topic for ex: register(orders , LZ4codec) register(places, Brotlicodec)
Disclaimer: I have some thoughts how to do that, I can create a draft(pull request in draft status) and you take a look. If you have another thoughts how to do that just discard
@ekuvardin, About proto messages, codec values 1-10000 are reserved for native support in YDB, Yes, right now there are only 4 implemented codecs, but their amount may be increased, For custom codecs there are reserved any value > 10000. So, if you want to use your own custom codec, you have to encode your message and send it with codec 10001 or any other, Then, where your reader will receive this message encoded body and specified by you codec, it is enough for decoding. I would like to add CodedRegistry to TopicClient, that registry will contain a mapping (integer code) -> (encoder/decoder). And we could specified the codec code for every separate topic reader/writer
@alex268 Write in English for future readers(Can in Russian)
Proposal
As a user, I want to easily specify custom code in Writer and Reader. After some weeks I finally understand what, to me, is a good interface. For example, I have an interface for a custom codec(Name of the interface can be change)
public interface TopicCodec {
InputStream decode(ByteArrayInputStream byteArrayOutputStream) throws IOException;
OutputStream encode(ByteArrayOutputStream byteArrayInputStream) throws IOException;
}
Writer
In WriterSettings, we have had the method setCodec. For me, it's good to have one place to specify codec ID and custom codec. For ex.
public Builder setCodec(int codec, TopicCodec topicCodec) {
this.codec = codec;
this.topicCodec = topicCodec;
return this;
}
Finally code looks like
WriterSettings settings = WriterSettings.newBuilder()
.setTopicPath(topicName)
.setCodec(codecId, codec)
.build();
SyncWriter writer = client.createSyncWriter(settings);
Reader
ReaderSettings readerSettings = ReaderSettings.newBuilder()
.addTopic(TopicReadSettings.newBuilder().setPath(topicName).build())
.setConsumerName(TEST_CONSUMER1)
.setCodec(codecId, codec)
.build();
SyncReader reader = client.createSyncReader(readerSettings);
Why it's a good solution
- WriterSettings and ReaderSettings are stable interfaces that users are already used to.
- Using Settings, I can have a separate codec for every topic.
- I don't have any global register, and only the user is responsible for the correct reader/writer.
- I have one method for specifying codec ID and custom codec. The user uses the method setCodec() for readers and writers, and they easily find where to specify a custom codec.
Tests
Ability to use custom codec with write and read
- User create topic
- User create custom codec
- Write data using custom codec
- Read data without errors using custom codec
Ability to use different custom codecs with write and read in one client
- User create topic1
- User create topic2
- User create custom codec1
- User create custom codec2
- User write data to topic1 using codec1 in client
- User write data to topic2 using codec2 in client
- User read data from topic1 using codec1
- User read data from topic2 using codec2
Fail when we try to use codec which can't decode
- User create topic
- User create custom codec1
- User create custom codec2
- Write data using custom codec1
- Read data using custom codec2. Can't decode
Read successes even we specify wrong codec id
- User create topic
- User create custom codec1
- Write data using custom codec1 and id 10013
- Read data using custom codec1 and id 19999. Test should pass cause process encode and decode is correct
I do some drafts in my own branch with tests - can create test pull request for better understanding how code changes.
It seems good, but there are still some problems:
- ReaderSettings doesn't have codec option, that's because the different messages can be encoded by different codec. For example, one writer can write raw messages to the topic and another writer can write messages encoded by zip to the same topic. And the one reader must read and decode all messages. So we have to use some kind of registry to keep all client codecs in the one place.
- If we are already forced to keep a separate registry (I would like to add it to TopicClient) - why we must use and the codec code and the codec class? We can use only the codec code, an implementation will be used for the registry.
There are also other benefits from the topic registry: we can provide the standard codecs in SDK, we can validate client custom codecs (it has to have code > 10000, should be no duplicate codes and etc.)
-
It's cool. I don't know this feature. Now I understand why the driver doesn't pass codecs to reader settings.
-
Answer to Question 2.
If we are already forced to keep a separate registry
yes you are correct. For my proposal different approach
- What I don't like in the solution using CodecRegistry( it's more connected to code than user experience)
CodecRegistry is local to TopicClient Finally, encode and decode logic is accumulated in class Encoder. I should pass CodecRegistry to Encode class. I have to:
- Pass CodecRegistry to TopicClient
- Pass CodecRegistry into every ReaderImpl/WriteImpl,
- In WriteImpl pass it to PartitionSessionImpl for decode(Batch)
It's annoying :) the first time I try to implement CodecRegistry and don't like what I have done; that's why I try to search for alternatives.
Think about how to simplify call register and setCodec: replace one call. That's why I think extend setCodec.
With CodecRegistry also seems good.
client.registerCodec(10103, customImpl)
WriterSettings settings = WriterSettings.newBuilder()
.setTopicPath(topicName)
.setCodec(codecId)
.build();
- Seems I understand your position with CodecRegistry, try to implement it.