greptimedb icon indicating copy to clipboard operation
greptimedb copied to clipboard

feat: add a size limit on a record batch

Open niebayes opened this issue 1 year ago • 9 comments

What problem does the new feature solve?

Currently, there is no size limit set on a record batch, which has become problematic due to the constraints imposed by the Tonic crate on message sizes for both receiving and sending. Tonic employs a default size limit of 4MB for received messages, defined as the DEFAULT_MAX_RECV_MESSAGE_SIZE constant, and a default maximum send message size of 2^64 - 1 bytes, defined as the DEFAULT_MAX_SEND_MESSAGE_SIZE constant.

To address this issue, it's crucial to set a size limit on a record batch to prevent it from exceeding 4MB.

What does the feature do?

Introduce a size limit on a record batch.

Implementation challenges

  • Set a default maximum size for a record batch, which could be either a fixed constant or a configurable option.
  • Utilize this size limit across the codebase to prevent any operations that could lead to the expansion of a record batch beyond the established threshold.
  • Develop a set of unit tests to verify the functionality of the size limit.

niebayes avatar Aug 22 '23 03:08 niebayes

Since the unit of transmission is a flight data which may contain at least one record batch, the size limit shall be much more conservative than the default maximum size set by Tonic.

niebayes avatar Aug 22 '23 03:08 niebayes

I'm interested in this feature.

Firstly, I have a question to ask, is this feature for server security by limiting the grpc request/response size?

If that's the case. I can see 2 ways to achieve this:

  • Limit http2 frame size (transport level)
  • Limit grpc service encoding/decoding message size (service level)

TheWaWaR avatar Sep 19 '23 12:09 TheWaWaR

@MichaelScofield PTAL

niebayes avatar Sep 20 '23 03:09 niebayes

@TheWaWaR Thx for your interest. It's not about security. It's a configuration on request/response's size limit in Tonic that should be considered.

I think the service level limitation is enough.

MichaelScofield avatar Sep 20 '23 04:09 MichaelScofield

By searching the codebase I find max_decoding_message_size already applied to grpc clients. So if I understand correctly the rest tasks should include:

  • Make grpc client max_encoding_message_size configurable
  • Make grpc server max_decoding_message_size/max_encoding_message_size configurable
  • Add tests for both client/server to prove the configuration works

TheWaWaR avatar Sep 20 '23 07:09 TheWaWaR

@TheWaWaR You are right.

MichaelScofield avatar Sep 20 '23 09:09 MichaelScofield

@TheWaWaR thx, I've merged your PR. However, to close this issue, I think it's important to set a limit on RecordBatches, too. The RecordBatches are popped from DataFusion, so it requires additional efforts to investigate how to restrict their size. Are you willing to continue to do that?

MichaelScofield avatar Sep 22 '23 02:09 MichaelScofield

@MichaelScofield

Are you willing to continue to do that?

Yes, why not.

So, I have some new questions:

First, how to define the size of a RecordBatches? There is no Serialize, Deserialize implementation for RecordBatches, but HttpRecordsOutput have Serialize, Deserialize implementation. And serialization can have multiple target formats such as json/msgpack/bincode. So the first question should be proper answered.

Second, what's the intension to limit the size of a RecordBatches? Wherever RecordBatches serialized/deserialized, currently it only transferred use grpc, why limit the grpc message size is not enough? Are there other places RecordBatches used and the size should be limited (other than grpc)? Or let me ask the other way: what are the cases when RecordBatches size not limited become a problem?

TheWaWaR avatar Sep 22 '23 02:09 TheWaWaR

@TheWaWaR sorry for the late reply. The size of a RecordBatch is indeed hard to estimate. The best guess might be sum up its vectors' memory_size.

As to the second question, now our grpc interface does have the size limit. However, what if we feed a huge RecordBatch to the grpc interface, will it segment the input inside to fit its size limit? A quick googling suggests that Tonic does not have this type of feature. So I guess it would be a problem if the underlying query engine pops a huge RecordBatch that is larger than the Tonic's size limit. It would be simply failed.

That said, limiting the size of RecordBatch (or segment it) is hard. I think a more proper place to do it is in FlightRecordBatchStream. When the encoded flight data is over the grpc size limit, segment it. This way we avoid calculating the size of RecordBatch. And since FlightData is a plain protobuf message, segmenting it should be easier.

MichaelScofield avatar Sep 26 '23 05:09 MichaelScofield