horaedb icon indicating copy to clipboard operation
horaedb copied to clipboard

Use a new custom struct for efficient encoding in the write procedure

Open ShiKaiWi opened this issue 2 years ago • 0 comments

Describe This Problem

In the current write procedure:

  • RowGroup is used for write method of Table trait;
  • Above the Table.write, there are two sources converted to RowGroup:
    • RemoteEngineService.write_batch receives the raw bytes of arrow record batch and converts the record batches to RowGroup;
    • StorageService.write receives the raw bytes of custom protobuf struct and converts the protobuf struct to RowGroup;
  • Under the Table.write, the RowGroup will be encoded into raw bytes for wal logs and memtable rows, and the wal log payload doesn't have any special requirement for the encoding method while the memtable rows require that the RowGroup must be encoded in rows to keep all rows in primary key order;

Proposal

From the description above, it can be found that there are too many conversions during the write procedure, leading to high CPU utilization, which has been proven in the production environment.

Maybe we can use only one struct for the whole write procedure to avoid extra conversions. And for the wal and memetable, I guess we can let the wal log payload shares the same encoded bytes used by memtable. And such struct must be designed for writing, that is to say, there is no need to include complex schema information.

Additional Context

The encoding and decoding of the arrow ipc performs very well, and I guess it should a benchmark for the new struct designed for write procedure.

ShiKaiWi avatar May 31 '23 02:05 ShiKaiWi