trino icon indicating copy to clipboard operation
trino copied to clipboard

Serde improvments

Open sopel39 opened this issue 2 years ago • 1 comments

DictionaryBlockEncoding:

  • we don't need to serialize ids as integers. We can use short or byte if dictionary has fewer positions

VariableWidthBlockEncoding

  • we don't need to serialize offsets as integers. We can use short if rawSlice is short enough.

sopel39 avatar Sep 21 '22 08:09 sopel39

cc @lukasz-stec

sopel39 avatar Sep 21 '22 08:09 sopel39

Hi, @sopel39.

I have some question about this issue. Please understand even if the question is stupid.

  1. (DictionaryBlockEncoding) In the case of ORC or Parquet, the spec of the element constituting ids is Unsigned Integer. Will there be a problem if it is changed to short or byte?
  2. (DictionaryBlockEncoding) Even if it is changed to a short or byte type, wouldn't deserialization performance decrease because 2 byte padding must be inserted in the middle of the slice composed of short/byte elements during the deserialization process?

I am interested in the issue, but I want to understand the exact context, so I ask this question.

leeyh0216 avatar Jan 07 '23 04:01 leeyh0216

(DictionaryBlockEncoding) In the case of ORC or Parquet, the spec of the element constituting ids is Unsigned Integer. Will there be a problem if it is changed to short or byte?

This problem is unrelated to either ORC or Parquet.

DictionaryBlockEncoding) Even if it is changed to a short or byte type, wouldn't deserialization performance decrease because 2 byte padding must be inserted in the middle of the slice composed of short/byte elements during the deserialization process?

It's more about reducing the size of payload. Less payload, less processing along the way => win even if CPU usage stays the same

sopel39 avatar Jan 11 '23 10:01 sopel39