horaedb icon indicating copy to clipboard operation
horaedb copied to clipboard

[PoC] Hybrid storage format

Open jiacai2050 opened this issue 2 years ago • 1 comments

Description

For now, data by default is ordered by timestamp column within one SST file(currently in Parquet format), each tag/field being a column.

Timestamp Device ID Status Code Tag 1 Tag 2
12:01 A 0 v1 v1
12:01 B 0 v2 v2
12:02 A 0 v1 v1
12:02 B 1 v2 v2
12:03 A 0 v1 v1
12:03 B 0 v2 v2

This design is good for OLAP queries, as it will only scan relevant columns, and CeresDB can take advantage of this ordering to filter unnecessary file, reducing IO further.

But for time-series user case like IoT or DevOps, this maybe not the best format. Those queries will typically first group its result by series id(or device-id), then by timestamp. This ordering isn't match with SST, so many random IOs will be incurred.

A general approach is to duplicate data twice: one ordered by timestamp first, and the other ordered by series id first.

Apparently this isn't very cost-effective, and will require some replication algorithm to synchronize data, which is very error-prone. It's best we could solve this ordering issue in one format.

Proposal

This issue propose one potential hybrid format (OLAP and time-series):

Device ID Timestamp Status Code Tag 1 Tag 2 minTime maxTime
A [12:01,12:02,12:03] [0,0,0] v1 v1 12:01 12:03
B [12:01,12:02,12:03] [0,1,0] v2 v2 12:01 12:03

In the above schema, instead of store timestamp row by row, we put timestamp within a device id in one array, and the corresponding values are also in array type, so we can easily map between them. The table is ordered by device ID.

In this way, we can avoid random IO when query one specific device, since its data are stored together, and this format is also beneficial for OLAP queries since we can use min/maxTime to help reader filter unnecessary chunks.

Additional context

Some references

jiacai2050 avatar Jun 30 '22 16:06 jiacai2050

I have done a benchmark in my local env, This hybrid format is better than the old one.

Table below summarize read cost in each format(each is read ten times).

Hybrid

cost row nums
615ms 10367743
576ms 10367743
585ms 10367743
511ms 10367743
558ms 10367743
569ms 10367743
568ms 10367743
555ms 10367743
557ms 10367743
584ms 10367743

Old

cost row nums
1304ms 10367743
1283ms 10367743
1276ms 10367743
1286ms 10367743
1275ms 10367743
1272ms 10367743
1273ms 10367743
1275ms 10367743
1275ms 10367743
1270ms 10367743

How it tests

Firstly, my test env is

  • Linux 5.17.7-arch1-1 SMP PREEMPT Thu, 12 May 2022 18:55:54 +0000 x86_64 GNU/Linux
  • 6c16g
  • commit: https://github.com/jiacai2050/ceresdb/tree/d9577d5d417a811d37ff54239b81b44eff1f499c

Data is generated using tsbs, with config below

data-source:
  simulator:
    debug: 0
    initial-scale: "0"
    log-interval: 10s
    max-data-points: "0"
    max-metric-count: "1"
    scale: "50000"
    seed: 100
    timestamp-start: "2022-07-02T00:00:00Z"
    timestamp-end: "2022-07-02T01:00:00Z"
    use-case: devops-generic
  type: SIMULATOR

This means the generated data source is

  • one metric within one hour, point interval is 10s, 50k series total.

Data sample

{
      "arch": "x86",
      "region": "ap-southeast-1",
      "service_environment": "test",
      "team": "SF",
      "value": 473.0,
      "service_version": "0",
      "datacenter": "ap-southeast-1b",
      "timestamp": 1656720000000,
      "os": "Ubuntu16.04LTS",
      "hostname": "host_3349",
      "rack": "80",
      "service": "6",
      "tsid": 1123006250071095
    }

Next step

Rebase with upstream master, apply this hybrid format with string column(currently only fixed-length column tested).

jiacai2050 avatar Jul 18 '22 08:07 jiacai2050

Checklist

  • [x] Write #185
  • [x] Read https://github.com/CeresDB/ceresdb/pull/208
  • [x] Add table option for storage format https://github.com/CeresDB/ceresdb/pull/218
  • [ ] Docs https://github.com/CeresDB/ceresdb/pull/222

jiacai2050 avatar Aug 17 '22 02:08 jiacai2050

There are some more things need to be done for good performance, leave here to keep a note for myself and hope others interested can get involved.

Write

  • Support variable-length type for ListArray
  • Support table without tsid, only a row id is required

Read

  • Support basic read(without any filter pushdown), WIP
  • Support timestamp column filter, some extra columns may be needed
  • Support variable-length type for ListArray
  • Enable a total ordering, to support query with pagination

Misc

  • Ensure row group size is large enough, in case of list length within same row_id is to small
  • Use dictionary array type to represent non-collapsible columns to reduce memory usage.
  • Benchmark between two format

jiacai2050 avatar Aug 24 '22 07:08 jiacai2050

Checklist

This checklist is outdated.

chunshao90 avatar Aug 24 '22 09:08 chunshao90