horaedb icon indicating copy to clipboard operation
horaedb copied to clipboard

Columnar memtable & Organize data in a columnar way in the write process

Open chunshao90 opened this issue 1 year ago • 2 comments

Describe This Problem

Organize data in a columnar way in the write process: Refer to #950 Columnar memtable to resolve following problems:

  1. At present, in the case of memtable with a lot of columns, if there are many fields with nulls written, the null fields will also take up space.
  2. At present, skiplistmemtable is implemented in ro-store, and the data compression effect is not good.

Proposal

  1. Impl a columnar memtable.
  2. Organize data in a columnar way in the write process.

We can refer to influxdb_iox.

Additional Context

I have done some poc.

row_group vs column_block vs influxdb_iox column

grpc write decode partition table split wal encode total
row group 177.456918ms 9.328081ms 87.680753ms 274.45ms
column block 229.374211ms 40.403491ms 22.222504ms 291.99ms
iox column 181.631366ms 37.563124ms 15.374956ms+(56.706159ms pb encode) 291.26ms

code branch: chunshao90/improve-encoding-in-write-procedure

cargo test --release --workspace test_write_entry_to_row_group_wal -- --nocapture

cargo test --release --workspace test_write_entry_to_column_block_wal -- --nocapture

cargo test --release --workspace test_write_entry_to_column_data_wal_pb -- --nocapture

columnar memtable vs skiplist memtable

normal table

memtable rows/s
row_group + skiplist memtable 14.99W
row_group + columnar memtable 15.39W
column + columnar memtable 14.91W

partition table

memtable rows/s
row_group + skiplist memtable 10.0W
row_group + columnar memtable 9.8W
column + columnar memtable 5.6W

colum + columnar memtbale branch: chunshao90/impl-write-with-column commit_id:0ec16ebe7843ea1740d22292612a604aedddd441 row_group + columnar memtbale branch: chunshao90/impl-row-group-columnar-memtable commit_id:22d943cea5e99b048f8b4b218bbe23c018d3ce50

tsbs cmd:
/tsbs_load_ceresdb --ceresdb-addr=127.0.0.1:8831 --file ./data.out --batch-size 100 --workers 100 --primary-keys hostname,region,datacenter,rack,os,arch,team,service,service_version,service_environment,timestamp
time,per. metric/s,metric total,overall metric/s,per. row/s,row total,overall row/s

result:
skiplist memtable
time,per. metric/s,metric total,overall metric/s,per. row/s,row total,overall row/s
1688546946,1585641.79,1.585700E+07,1585641.79,158564.18,1.585700E+06,158564.18
1688546956,1592642.95,3.178300E+07,1589142.26,159264.30,3.178300E+06,158914.23
1688546966,1505925.05,4.684300E+07,1561402.36,150592.50,4.684300E+06,156140.24
1688546976,1393255.67,6.077500E+07,1519367.57,139325.57,6.077500E+06,151936.76

Summary:
loaded 72000000 metrics in 48.030sec with 100 workers (mean rate 1499063.87 metrics/sec)
loaded 7200000 rows in 48.030sec with 100 workers (mean rate 149906.39 rows/sec)


row_group columnar memtable
time,per. metric/s,metric total,overall metric/s,per. row/s,row total,overall row/s
1688546571,1612907.18,1.613100E+07,1612907.18,161290.72,1.613100E+06,161290.72
1688546581,1674263.27,3.287300E+07,1643582.81,167426.33,3.287300E+06,164358.28
1688546591,1617414.15,4.904600E+07,1634860.57,161741.42,4.904600E+06,163486.06
1688546601,1318711.13,6.223300E+07,1555823.93,131871.11,6.223300E+06,155582.39

Summary:
loaded 72000000 metrics in 46.755sec with 100 workers (mean rate 1539946.13 metrics/sec)
loaded 7200000 rows in 46.755sec with 100 workers (mean rate 153994.61 rows/sec)


columnar memtable
time,per. metric/s,metric total,overall metric/s,per. row/s,row total,overall row/s
1688547630,1602775.15,1.602800E+07,1602775.15,160277.51,1.602800E+06,160277.51
1688547640,1573511.39,3.176300E+07,1588143.43,157351.14,3.176300E+06,158814.34
1688547650,1484345.30,4.661500E+07,1553530.87,148434.53,4.661500E+06,155353.09
1688547660,1392702.56,6.053400E+07,1513347.04,139270.26,6.053400E+06,151334.70

Summary:
loaded 72000000 metrics in 48.277sec with 100 workers (mean rate 1491384.57 metrics/sec)
loaded 7200000 rows in 48.277sec with 100 workers (mean rate 149138.46 rows/sec)

Summary:
loaded 72000000 metrics in 50.053sec with 100 workers (mean rate 1438472.55 metrics/sec)
loaded 7200000 rows in 50.053sec with 100 workers (mean rate 143847.25 rows/sec)

partition table:
skiplist memtable
time,per. metric/s,metric total,overall metric/s,per. row/s,row total,overall row/s
1688544758,1047144.86,1.047500E+07,1047144.86,104714.49,1.047500E+06,104714.49
1688544768,982702.45,2.029900E+07,1014934.08,98270.25,2.029900E+06,101493.41
1688544778,1028836.24,3.059000E+07,1019568.87,102883.62,3.059000E+06,101956.89
1688544788,1036873.44,4.095600E+07,1023893.85,103687.34,4.095600E+06,102389.38
1688544798,971114.41,5.066700E+07,1013338.14,97111.44,5.066700E+06,101333.81
1688544808,896769.76,5.963500E+07,993909.56,89676.98,5.963500E+06,99390.96
1688544818,1034585.41,6.998100E+07,999720.43,103458.54,6.998100E+06,99972.04

Summary:
loaded 72000000 metrics in 71.941sec with 100 workers (mean rate 1000821.83 metrics/sec)
loaded 7200000 rows in 71.941sec with 100 workers (mean rate 100082.18 rows/sec)

row_group columnar memtable
time,per. metric/s,metric total,overall metric/s,per. row/s,row total,overall row/s
1688544157,1026895.59,1.027100E+07,1026895.59,102689.56,1.027100E+06,102689.56
1688544167,1027156.72,2.054300E+07,1027026.15,102715.67,2.054300E+06,102702.61
1688544177,897363.91,2.951500E+07,983814.14,89736.39,2.951500E+06,98381.41
1688544187,890427.96,3.841900E+07,960468.49,89042.80,3.841900E+06,96046.85
1688544197,977477.31,4.819400E+07,963870.30,97747.73,4.819400E+06,96387.03
1688544207,1005044.19,5.824400E+07,970732.30,100504.42,5.824400E+06,97073.23
1688544217,1031550.07,6.856100E+07,979421.63,103155.01,6.856100E+06,97942.16

Summary:
loaded 72000000 metrics in 73.337sec with 100 workers (mean rate 981764.83 metrics/sec)
loaded 7200000 rows in 73.337sec with 100 workers (mean rate 98176.48 rows/sec)

influxdb iox
time,per. metric/s,metric total,overall metric/s,per. row/s,row total,overall row/s
1688545755,673633.18,6.738000E+06,673633.18,67363.32,6.738000E+05,67363.32
1688545765,432773.08,1.106500E+07,553228.21,43277.31,1.106500E+06,55322.82
1688545775,557813.75,1.664300E+07,554756.66,55781.38,1.664300E+06,55475.67
1688545785,636098.43,2.300600E+07,575096.66,63609.84,2.300600E+06,57509.67
1688545795,492526.40,2.793100E+07,558584.54,49252.64,2.793100E+06,55858.45
1688545805,590577.06,3.383500E+07,563915.01,59057.71,3.383500E+06,56391.50
1688545815,345059.84,3.728700E+07,532639.18,34505.98,3.728700E+06,53263.92
1688545825,358519.76,4.087100E+07,510881.77,35851.98,4.087100E+06,51088.18
1688545835,504602.80,4.591800E+07,510183.99,50460.28,4.591800E+06,51018.40
1688545845,756105.06,5.347700E+07,534769.40,75610.51,5.347700E+06,53476.94

normal_table_main

normal_table_main

normal_table_row_group_columnar_memtable

normal_table_row_group_columnar_memtable

normal_table_column_columnar_memtable

normal_table_column_columnar_memtable

partition_table_main

partition_table_main

partition_table_row_group_columnar_memtable

partition_table_row_group_columnar_memtable

partition_table_column_columnar_memtable

partition_table_column_columnar_memtable

chunshao90 avatar Jul 03 '23 09:07 chunshao90

From your experiments, there is no any obvious advantage (seems worse) to choose columnar layout in write procedure.

ShiKaiWi avatar Jul 04 '23 10:07 ShiKaiWi

I think columnar layout in write procedure will helps a lot. The flamegraph shows that Message::decode and write_table_request_to_insert_plan consumes a lot of CPU。

if use a columnar layout in write procedure,

  1. we can simplify the proto buffer message WriteTableRequest by using a few Arrow column, which can simplify the PB decode procedure.
  2. there is no need to convert WriteTableRequest to columnar layout in write_table_request_to_insert_plan
image

zouxiang1993 avatar Jul 05 '23 11:07 zouxiang1993