graphite-clickhouse icon indicating copy to clipboard operation
graphite-clickhouse copied to clipboard

Metrics with NULL values are not processed correctly

Open PhantomPhreak opened this issue 2 years ago • 4 comments

Recently we migrated from graphite-web + whisper to graphite-web + graphite-clickhouse + clickhouse schema, and there are some issues with processing NULL values.

Whisper's *.wsp files has pre-defined resolution:

aggregationMethod: average
maxRetention: 157680000
xFilesFactor: 0.5
fileSize: 12614440

Archive 0
offset: 40
secondsPerPoint: 60
points: 525600
retention: 31536000
size: 6307200

which means that value for each timestamp may be NULL or some actual value, like following:

# whisper-fetch metric.wsp
1642054380      0.000977
1642054440      0.000977
1642054500      0.000977
1642054560      None
1642054620      0.000977
1642054680      0.000977
1642054740      None

In this case, graphite + whisper always return some value for each datapoint, even if all of them are NULLs. This may be used by transformNull() graphite functions to convert them into specific values and fill the gaps if necessary.

graphite-clickhouse + clickhouse has different behavior: in case if at least 1 datapoint has non-NULL value, this value + NULL for other datapoints will be returned. This is query result from ClickHouse and how data is presented in Grafana (exported CSV)

SELECT
    Path,
    groupArray(Time),
    groupArray(Value),
    groupArray(Timestamp)
FROM graphite.data
PREWHERE (Date >= toDate(1642020420)) AND (Date <= toDate(1642023719))
WHERE (Path IN ('metric')) AND ((Time >= 1642020420) AND (Time <= 1642023719))
GROUP BY Path
FORMAT Vertical

Query id: 1eccd72b-9a8d-4825-b7f1-46bc21e5fde7

Row 1:
──────
Path:                  metric
groupArray(Time):      [1642022160]
groupArray(Value):     [0.0009765625]
groupArray(Timestamp): [1642022217]

1 rows in set. Elapsed: 0.020 sec. Processed 40.96 thousand rows, 3.68 MB (2.02 million rows/s., 181.21 MB/s.)
Grafana (exported CSV)

"Time","metric"
2022-01-13 03:47:00,
2022-01-13 03:48:00,
2022-01-13 03:49:00,
2022-01-13 03:50:00,
2022-01-13 03:51:00,
2022-01-13 03:52:00,
2022-01-13 03:53:00,
2022-01-13 03:54:00,
2022-01-13 03:55:00,
2022-01-13 03:56:00,
2022-01-13 03:57:00,
2022-01-13 03:58:00,
2022-01-13 03:59:00,
2022-01-13 04:00:00,
2022-01-13 04:01:00,
2022-01-13 04:02:00,
2022-01-13 04:03:00,
2022-01-13 04:04:00,
2022-01-13 04:05:00,
2022-01-13 04:06:00,
2022-01-13 04:07:00,
2022-01-13 04:08:00,
2022-01-13 04:09:00,
2022-01-13 04:10:00,
2022-01-13 04:11:00,
2022-01-13 04:12:00,
2022-01-13 04:13:00,
2022-01-13 04:14:00,
2022-01-13 04:15:00,
2022-01-13 04:16:00,0.00098
2022-01-13 04:17:00,
2022-01-13 04:18:00,
2022-01-13 04:19:00,
2022-01-13 04:20:00,
2022-01-13 04:21:00,
2022-01-13 04:22:00,
2022-01-13 04:23:00,
2022-01-13 04:24:00,
2022-01-13 04:25:00,
2022-01-13 04:26:00,
2022-01-13 04:27:00,
2022-01-13 04:28:00,
2022-01-13 04:29:00,
2022-01-13 04:30:00,
2022-01-13 04:31:00,
2022-01-13 04:32:00,
2022-01-13 04:33:00,
2022-01-13 04:34:00,
2022-01-13 04:35:00,
2022-01-13 04:36:00,
2022-01-13 04:37:00,
2022-01-13 04:38:00,
2022-01-13 04:39:00,
2022-01-13 04:40:00,
2022-01-13 04:41:00,

In case if selected interval has only NULL values, nothing is returned from ClickHouse to graphite-clickhouse, and from graphite-web to Grafana:

SELECT
    'metric' AS Path,
    groupArray(Time),
    groupArray(Value),
    groupArray(Timestamp)
FROM graphite.data
PREWHERE (Date >= toDate(1642020420)) AND (Date <= toDate(1642023719))
WHERE (Path IN ('metric)) AND ((Time >= 1642020420) AND (Time <= 1642023719))
GROUP BY Path
FORMAT Vertical

Query id: c1d5a623-0d5e-4826-9303-f5af306a7e2c

Ok.

0 rows in set. Elapsed: 0.012 sec. 

Grafana displaying "No data" for this time period, but "No data" and "NULL" values are completely different, and expected behavior in this case - return NULL values for all datapoints.

Real-world problem: we have a metric representing request processing quantiles, and if there are no requests (night time), there are no values for quantiles is calculated. In the monitoring system, tracking this quantiles, we can replace NULL values with 0 using transformNull() , and differentiate the situation, when metric is broken and when there are just no requests served. This logic is broken now, because we may have a sequence of NULL values for the metric, exceeding selected time-window (5 minutes in our case).

I suppose it can be possible to send NULLs for all datapoints from graphite-clickhouse even if there are empty result returned from the ClickHouse, if we're using rollup config and already know the metric's precision for selected time interval.

We're using "dummy" rollup configuration for now:

SELECT *
FROM system.graphite_retentions

Query id: ce6cd594-5498-47fd-a3a4-d54adade8d1f

┌─config_name─────┬─regexp─┬─function─┬─age─┬─precision─┬─priority─┬─is_default─┬─Tables.database─┬─Tables.table───┐
│ graphite_rollup │        │ any      │   0 │        60 │    65535 │          1 │ ['graphite']    │ ['data_local'] │
└─────────────────┴────────┴──────────┴─────┴───────────┴──────────┴────────────┴─────────────────┴────────────────┘

1 rows in set. Elapsed: 1.548 sec.

Or, if there are any other way for workaround - i'd be happy to know.

And thank you for the amazing go-graphite stack :)

PhantomPhreak avatar Jan 14 '22 07:01 PhantomPhreak

Hi. To have proper null-points you need to have a valid retention config for the GraphiteMergeTree table. Then graphite-clickhouse returns valid None-points for any format.

Felixoid avatar Jan 14 '22 09:01 Felixoid

Humm. Possibly i don't understand correctly what is "valid" retention config in this case. Here is my configuration

SELECT *
FROM system.graphite_retentions

Query id: b3b2f340-71fe-4b5d-851c-f4ebd664beb7

┌─config_name─────┬─regexp─┬─function─┬─age─┬─precision─┬─priority─┬─is_default─┬─Tables.database─┬─Tables.table───┐                                                                                       
│ graphite_rollup │        │ any      │   0 │        60 │    65535 │          1 │ ['graphite']    │ ['data_local'] │                                                                                       
└─────────────────┴────────┴──────────┴─────┴───────────┴──────────┴────────────┴─────────────────┴────────────────┘                                                                                       

1 rows in set. Elapsed: 0.002 sec.

rollup config in /etc/clickhouse-server/config.d

<yandex>
        <graphite_rollup>
        <!-- default -->
                <default>
                        <function>any</function>
                        <retention>
                                <age>0</age>
                                <precision>60</precision>
                        </retention>
                </default>
        </graphite_rollup>
</yandex>

graphite-clickhouse configured to use table with Distributed engine (since it allow to avoid seding query request to the stale replica)

[clickhouse]
...
 index-table = "graphite.index"
...
 internal-aggregation = false

[[data-table]]
 table = "graphite.data"
...
 # custom rollup.xml file for table, 'auto' and 'none' are allowed as well
 rollup-conf = "auto"
 # custom table for 'rollup-conf=auto', useful for Distributed or MatView
 rollup-auto-table = "graphite.data_local"
 # is used when none of rules match
 rollup-default-precision = 0
 # is used when none of rules match
 rollup-default-function = ""
 # should be set to true if you don't have reverted regexps in rollup-conf for reversed tables
 rollup-use-reverted = false

With this configuration, graphite-clickhouse issuing following queries:

SELECT
    regexp,
    function,
    age,
    precision,
    is_default
FROM system.graphite_retentions
ARRAY JOIN Tables AS table
WHERE (table.database = 'graphite') AND (table.table = 'data_local')
ORDER BY
    is_default ASC,
    priority ASC,
    regexp ASC,
    age ASC
FORMAT JSON

Query id: 82e5d1c5-2178-473b-bfab-ffe6c5eae3e6

{
        "meta":
        [
                {
                        "name": "regexp",
                        "type": "String"
                },
                {
                        "name": "function",
                        "type": "String"
                },
                {
                        "name": "age",
                        "type": "UInt64"
                },
                {
                        "name": "precision",
                        "type": "UInt64"
                },
                {
                        "name": "is_default",
                        "type": "UInt8"
                }
        ],

        "data":
        [
                {
                        "regexp": "",
                        "function": "any",
                        "age": "0",
                        "precision": "60",
                        "is_default": 1
                }

        ],

        "rows": 1,

        "statistics":
        {
                "elapsed": 0.000710489,
                "rows_read": 1,
                "bytes_read": 116
        }
}

1 rows in set. Elapsed: 0.023 sec.

Now, related to https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/graphitemergetree/#rollup-configuration

age – The minimum age of the data in seconds. precision– How precisely to define the age of the data in seconds. Should be a divisor for 86400 (seconds in a day). function – The name of the aggregating function to apply to data whose age falls within the range [age, age + precision]. Accepted functions: min / max / any / avg. The average is calculated imprecisely, like the average of the averages.

age defining minumum age of data, in my case it's 0 which means "all data" precision - i'd like to keep one value for each 60s function - i use any

purpose of this config - keep only 1 value for each minute, if more that 1 value received within a 60s interval, and keep this data with 1-minute resolution without re-sampling (rollup).

Just in case, CH version: 21.8.12.29 graphite-clickhouse: 0.13.2

table's schema

CREATE TABLE graphite.data_local
(
    `Path` String CODEC(ZSTD(16)),
    `Value` Float64 CODEC(Gorilla, ZSTD(16)),
    `Time` UInt32 CODEC(DoubleDelta),
    `Date` Date CODEC(DoubleDelta, LZ4HC(9)),
    `Timestamp` UInt32 CODEC(DoubleDelta)
)
ENGINE = ReplicatedGraphiteMergeTree('/clickhouse/tables/401246e0-d1af-4835-8012-46e0d1afa835/{shard}', '{replica}', 'graphite_rollup')
PARTITION BY toYYYYMM(Date)
PRIMARY KEY (Path, round(Time, -5))
ORDER BY (Path, round(Time, -5), Time)
SETTINGS index_granularity = 8192

Maybe i missing something or misunderstanding the documentation? I'd be grateful for any hint :)

PhantomPhreak avatar Jan 14 '22 11:01 PhantomPhreak

Ok, sorry for my previous comment, I haven't slept enough today and got the issue differently, because you have a distributed table graphite.data, but the table in graphite_rollup is graphite.data_local.

So, the case is legal, although it would very significantly change the default behavior, and would be unexpected for the most of users. Currently even is old metrics are in index, they aren't listed in a metrics list https://github.com/go-graphite/graphite-clickhouse/issues/83

It's possible to add a setting and fill-up missing metrics, though.

Since the project doesn't have an active maintainer currently, it makes sense to consider a PR. We can discuss it in https://t.me/ru_go_graphite :wink:

Felixoid avatar Jan 14 '22 16:01 Felixoid

No problems, i haven't sleep enough as well and thought i was missing something :sweat_smile: Yeah, for backward compatibility this logic can be added by additional setting

PhantomPhreak avatar Jan 17 '22 15:01 PhantomPhreak