horaedb
horaedb copied to clipboard
[PoC] Hybrid storage format
Description
For now, data by default is ordered by timestamp
column within one SST file(currently in Parquet format), each tag/field being a column.
Timestamp | Device ID | Status Code | Tag 1 | Tag 2 |
---|---|---|---|---|
12:01 | A | 0 | v1 | v1 |
12:01 | B | 0 | v2 | v2 |
12:02 | A | 0 | v1 | v1 |
12:02 | B | 1 | v2 | v2 |
12:03 | A | 0 | v1 | v1 |
12:03 | B | 0 | v2 | v2 |
… |
This design is good for OLAP queries, as it will only scan relevant columns, and CeresDB can take advantage of this ordering to filter unnecessary file, reducing IO further.
But for time-series user case like IoT or DevOps, this maybe not the best format. Those queries will typically first group its result by series id(or device-id), then by timestamp. This ordering isn't match with SST, so many random IOs will be incurred.
A general approach is to duplicate data twice: one ordered by timestamp first, and the other ordered by series id first.
Apparently this isn't very cost-effective, and will require some replication algorithm to synchronize data, which is very error-prone. It's best we could solve this ordering
issue in one format.
Proposal
This issue propose one potential hybrid format (OLAP and time-series):
Device ID | Timestamp | Status Code | Tag 1 | Tag 2 | minTime | maxTime |
---|---|---|---|---|---|---|
A | [12:01,12:02,12:03] | [0,0,0] | v1 | v1 | 12:01 | 12:03 |
B | [12:01,12:02,12:03] | [0,1,0] | v2 | v2 | 12:01 | 12:03 |
… |
In the above schema, instead of store timestamp row by row, we put timestamp within a device id in one array, and the corresponding values are also in array type, so we can easily map between them. The table is ordered by device ID.
In this way, we can avoid random IO when query one specific device, since its data are stored together, and this format is also beneficial for OLAP queries since we can use min/maxTime to help reader filter unnecessary chunks.
Additional context
Some references
I have done a benchmark in my local env, This hybrid format is better than the old one.
Table below summarize read cost in each format(each is read ten times).
Hybrid
cost | row nums |
---|---|
615ms | 10367743 |
576ms | 10367743 |
585ms | 10367743 |
511ms | 10367743 |
558ms | 10367743 |
569ms | 10367743 |
568ms | 10367743 |
555ms | 10367743 |
557ms | 10367743 |
584ms | 10367743 |
Old
cost | row nums |
---|---|
1304ms | 10367743 |
1283ms | 10367743 |
1276ms | 10367743 |
1286ms | 10367743 |
1275ms | 10367743 |
1272ms | 10367743 |
1273ms | 10367743 |
1275ms | 10367743 |
1275ms | 10367743 |
1270ms | 10367743 |
How it tests
Firstly, my test env is
- Linux 5.17.7-arch1-1 SMP PREEMPT Thu, 12 May 2022 18:55:54 +0000 x86_64 GNU/Linux
- 6c16g
- commit: https://github.com/jiacai2050/ceresdb/tree/d9577d5d417a811d37ff54239b81b44eff1f499c
Data is generated using tsbs, with config below
data-source:
simulator:
debug: 0
initial-scale: "0"
log-interval: 10s
max-data-points: "0"
max-metric-count: "1"
scale: "50000"
seed: 100
timestamp-start: "2022-07-02T00:00:00Z"
timestamp-end: "2022-07-02T01:00:00Z"
use-case: devops-generic
type: SIMULATOR
This means the generated data source is
- one metric within one hour, point interval is 10s, 50k series total.
Data sample
{
"arch": "x86",
"region": "ap-southeast-1",
"service_environment": "test",
"team": "SF",
"value": 473.0,
"service_version": "0",
"datacenter": "ap-southeast-1b",
"timestamp": 1656720000000,
"os": "Ubuntu16.04LTS",
"hostname": "host_3349",
"rack": "80",
"service": "6",
"tsid": 1123006250071095
}
Next step
Rebase with upstream master, apply this hybrid format with string column(currently only fixed-length column tested).
Checklist
- [x] Write #185
- [x] Read https://github.com/CeresDB/ceresdb/pull/208
- [x] Add table option for storage format https://github.com/CeresDB/ceresdb/pull/218
- [ ] Docs https://github.com/CeresDB/ceresdb/pull/222
There are some more things need to be done for good performance, leave here to keep a note for myself and hope others interested can get involved.
Write
- Support variable-length type for
ListArray
- Support table without tsid, only a
row id
is required
Read
- Support basic read(without any filter pushdown), WIP
- Support timestamp column filter, some extra columns may be needed
- Support variable-length type for
ListArray
- Enable a total ordering, to support query with pagination
Misc
- Ensure row group size is large enough, in case of list length within same row_id is to small
- Use dictionary array type to represent non-collapsible columns to reduce memory usage.
- Benchmark between two format
Checklist
- [x] Write feat: write hybrid storage format #185
- [ ] Read
- [ ] Add table option for storage format
- [ ] More testcases for write/read
This checklist is outdated.