Description

For now, data by default is ordered by timestamp column within one SST file(currently in Parquet format), each tag/field being a column.

Timestamp	Device ID	Status Code	Tag 1	Tag 2
12:01	A	0	v1	v1
12:01	B	0	v2	v2
12:02	A	0	v1	v1
12:02	B	1	v2	v2
12:03	A	0	v1	v1
12:03	B	0	v2	v2
…

This design is good for OLAP queries, as it will only scan relevant columns, and CeresDB can take advantage of this ordering to filter unnecessary file, reducing IO further.

But for time-series user case like IoT or DevOps, this maybe not the best format. Those queries will typically first group its result by series id(or device-id), then by timestamp. This ordering isn't match with SST, so many random IOs will be incurred.

A general approach is to duplicate data twice: one ordered by timestamp first, and the other ordered by series id first.

Apparently this isn't very cost-effective, and will require some replication algorithm to synchronize data, which is very error-prone. It's best we could solve this ordering issue in one format.

Proposal

This issue propose one potential hybrid format (OLAP and time-series):

Device ID	Timestamp	Status Code	Tag 1	Tag 2	minTime	maxTime
A	[12:01,12:02,12:03]	[0,0,0]	v1	v1	12:01	12:03
B	[12:01,12:02,12:03]	[0,1,0]	v2	v2	12:01	12:03
…

In the above schema, instead of store timestamp row by row, we put timestamp within a device id in one array, and the corresponding values are also in array type, so we can easily map between them. The table is ordered by device ID.

In this way, we can avoid random IO when query one specific device, since its data are stored together, and this format is also beneficial for OLAP queries since we can use min/maxTime to help reader filter unnecessary chunks.

Additional context

Some references

Building columnar compression in a row-oriented database

Jun 30 '22 16:06 jiacai2050

I have done a benchmark in my local env, This hybrid format is better than the old one.

Table below summarize read cost in each format(each is read ten times).

Hybrid

cost	row nums
615ms	10367743
576ms	10367743
585ms	10367743
511ms	10367743
558ms	10367743
569ms	10367743
568ms	10367743
555ms	10367743
557ms	10367743
584ms	10367743

Old

cost	row nums
1304ms	10367743
1283ms	10367743
1276ms	10367743
1286ms	10367743
1275ms	10367743
1272ms	10367743
1273ms	10367743
1275ms	10367743
1275ms	10367743
1270ms	10367743

How it tests

Firstly, my test env is

Linux 5.17.7-arch1-1 SMP PREEMPT Thu, 12 May 2022 18:55:54 +0000 x86_64 GNU/Linux
6c16g
commit: https://github.com/jiacai2050/ceresdb/tree/d9577d5d417a811d37ff54239b81b44eff1f499c
- bench-hybrid.rs

Data is generated using tsbs, with config below

data-source:
  simulator:
    debug: 0
    initial-scale: "0"
    log-interval: 10s
    max-data-points: "0"
    max-metric-count: "1"
    scale: "50000"
    seed: 100
    timestamp-start: "2022-07-02T00:00:00Z"
    timestamp-end: "2022-07-02T01:00:00Z"
    use-case: devops-generic
  type: SIMULATOR

This means the generated data source is

one metric within one hour, point interval is 10s, 50k series total.

Data sample

{
      "arch": "x86",
      "region": "ap-southeast-1",
      "service_environment": "test",
      "team": "SF",
      "value": 473.0,
      "service_version": "0",
      "datacenter": "ap-southeast-1b",
      "timestamp": 1656720000000,
      "os": "Ubuntu16.04LTS",
      "hostname": "host_3349",
      "rack": "80",
      "service": "6",
      "tsid": 1123006250071095
    }

Next step

Rebase with upstream master, apply this hybrid format with string column(currently only fixed-length column tested).

Jul 18 '22 08:07 jiacai2050

Checklist

[x] Write #185
[x] Read https://github.com/CeresDB/ceresdb/pull/208
[x] Add table option for storage format https://github.com/CeresDB/ceresdb/pull/218
[ ] Docs https://github.com/CeresDB/ceresdb/pull/222

Aug 17 '22 02:08 jiacai2050

There are some more things need to be done for good performance, leave here to keep a note for myself and hope others interested can get involved.

Write

Support variable-length type for ListArray
Support table without tsid, only a row id is required

Read

Support basic read(without any filter pushdown), WIP
Support timestamp column filter, some extra columns may be needed
Support variable-length type for ListArray
Enable a total ordering, to support query with pagination

Misc

Ensure row group size is large enough, in case of list length within same row_id is to small
Use dictionary array type to represent non-collapsible columns to reduce memory usage.
Benchmark between two format

Aug 24 '22 07:08 jiacai2050

Checklist

[x] Write feat: write hybrid storage format #185

[ ] Read

[ ] Add table option for storage format

[ ] More testcases for write/read

This checklist is outdated.

Aug 24 '22 09:08 chunshao90

horaedb
horaedb copied to clipboard

[PoC] Hybrid storage format

Hybrid

Old

How it tests

Next step

Checklist

Write

Read

Misc

Checklist

horaedb horaedb copied to clipboard

[PoC] Hybrid storage format

Hybrid

Old

How it tests

Next step

Checklist

Write

Read

Misc

Checklist

horaedb
horaedb copied to clipboard