horaedb
horaedb copied to clipboard
Use a new custom struct for efficient encoding in the write procedure
Describe This Problem
In the current write procedure:
-
RowGroupis used forwritemethod ofTabletrait; - Above the
Table.write, there are two sources converted toRowGroup:-
RemoteEngineService.write_batchreceives the raw bytes of arrow record batch and converts the record batches toRowGroup; -
StorageService.writereceives the raw bytes of custom protobuf struct and converts the protobuf struct toRowGroup;
-
- Under the
Table.write, theRowGroupwill be encoded into raw bytes for wal logs and memtable rows, and the wal log payload doesn't have any special requirement for the encoding method while the memtable rows require that theRowGroupmust be encoded in rows to keep all rows in primary key order;
Proposal
From the description above, it can be found that there are too many conversions during the write procedure, leading to high CPU utilization, which has been proven in the production environment.
Maybe we can use only one struct for the whole write procedure to avoid extra conversions. And for the wal and memetable, I guess we can let the wal log payload shares the same encoded bytes used by memtable. And such struct must be designed for writing, that is to say, there is no need to include complex schema information.
Additional Context
The encoding and decoding of the arrow ipc performs very well, and I guess it should a benchmark for the new struct designed for write procedure.