qryn ORDER BY/ PRIMARY KEY for `samples` table

I see that the primary key is chosen completely wrong for samples table. https://github.com/metrico/qryn/blob/161117c/lib/db/maintain/scripts.js#L13

This means that in one partition there will be completely different records with different fingerprints. In such a situation, there can be no effective compression due to heterogeneous content.

Moreover, fingerprint search is not efficient at all. For some prints, I have hundred millions lines of logs in an day. This leads to the fact that when searching for other fingerprints in the same time range, millions of spurious lines are scanned.

i'm sure the tuples should be ordered like this ORDER BY (fingerprint, timestamp_ns)

Sep 14 '22 12:09 R-omk

Ok. Let's check a complicated clickhouse request and discuss which primary key we should use.

sum(rate({forwarder="vector"} | regexp "appname\":\"(?<a>[^\"]+)" [1m])) by (a)

WITH
    idx_sel AS
    (
        SELECT sel_1.fingerprint
        FROM
        (
            SELECT fingerprint
            FROM cloki.time_series_gin
            WHERE (key = 'forwarder') AND (val = 'vector')
        ) AS sel_1
    ),
    rate_a AS
    (
        SELECT
            samples.string AS string,
            samples.fingerprint AS fingerprint,
            intDiv(samples.timestamp_ns, 1000000) AS timestamp_ns,
            JSONExtractKeysAndValues(time_series.labels, 'String') AS labels,
            arrayFilter(x -> (((x.1) != '') AND ((x.2) != '')), arrayZip(['a'], arrayMap(x -> (x[length(x)]), extractAllGroupsHorizontal(string, 'appname":"([^"]+)')))) AS extra_labels
        FROM cloki.samples_v3 AS samples
        ANY LEFT JOIN cloki.time_series ON samples.fingerprint = time_series.fingerprint
        WHERE ((samples.timestamp_ns >= 1663167960000000000) AND (samples.timestamp_ns <= 1663168320000000000)) AND (samples.fingerprint IN (idx_sel))
        ORDER BY timestamp_ns DESC
    ),
    rate_b AS
    (
        SELECT
            arraySort(arrayConcat(arrayFilter(x -> (arrayExists(y -> ((y.1) = (x.1)), extra_labels) = 0), labels), extra_labels)) AS labels,
            intDiv(timestamp_ns, 60000) * 60000 AS timestamp_ns,
            (toFloat64(count(1)) * 1000) / 60000 AS value
        FROM rate_a
        GROUP BY
            labels,
            timestamp_ns
        ORDER BY
            labels ASC,
            timestamp_ns ASC
    ),
    agg_a AS
    (
        SELECT
            labels,
            intDiv(timestamp_ns, 100000) * 100000 AS timestamp_ns,
            argMin(rate_b.value, rate_b.timestamp_ns) AS value
        FROM rate_b
        GROUP BY
            labels,
            timestamp_ns
        ORDER BY
            labels ASC,
            timestamp_ns ASC
    )
SELECT
    arrayFilter(x -> ((x.1) IN ('a')), labels) AS labels,
    timestamp_ns,
    sum(value) AS value
FROM agg_a
GROUP BY
    labels,
    timestamp_ns
ORDER BY
    labels ASC,
    timestamp_ns ASC

The ordering as @R-omk is suggesting used to exist. It was removed in the commit https://github.com/metrico/qryn/commit/496195f6f0115fce7bbb305a6f4db280b8b34dff about a year ago because of OOM errors upon such complicated requests.

Sep 14 '22 15:09 akvlad

Actually I have been thinking lastly about some smart approach for this. Let's say we have a more deterministic part of fingerprint in another field. It is an 8-bit hash of subgroup of labels based on your last requests. For example you frequently search by the {app="myapp"} tag. Why can't the software remember that and optimize the newly incoming data accordingly?

Sep 14 '22 17:09 akvlad

@akvlad , please explain what exactly is the problem and under what conditions it occurs.

--

Let's say we have a more deterministic part.

as an option, it can exclude some labels from the fingerprint, but at the same time keep it in timeseries and gin as the second cascade of index. That approach may provide less cardinality for fingerprints. (if this is the problem that caused the schema to be changed)

Sep 14 '22 17:09 R-omk

rate_a AS
    (
        SELECT
            samples.string AS string,
            samples.fingerprint AS fingerprint,
            intDiv(samples.timestamp_ns, 1000000) AS timestamp_ns,
            JSONExtractKeysAndValues(time_series.labels, 'String') AS labels,
            arrayFilter(x -> (((x.1) != '') AND ((x.2) != '')), arrayZip(['a'], arrayMap(x -> (x[length(x)]), extractAllGroupsHorizontal(string, 'appname":"([^"]+)')))) AS extra_labels
        FROM cloki.samples_v3 AS samples
        ANY LEFT JOIN cloki.time_series ON samples.fingerprint = time_series.fingerprint
        WHERE ((samples.timestamp_ns >= 1663167960000000000) AND (samples.timestamp_ns <= 1663168320000000000)) AND (samples.fingerprint IN (idx_sel))
        ORDER BY timestamp_ns DESC
    ),

It seems that left join is not needed here at all, the labels can be retrieved in a separate subquery via IN (rate_a.fingerprint)

In addition, it is required to limit the subset of labels and remove duplicates.

Sep 15 '22 10:09 R-omk

It is difficult to understand how the optimizer will work on such a large query. I would probably use sessions and use temporary tables with a specialized engine to work with sets and/or joins. In this case, the execution plan becomes more predictable, including when working with sharding.

For example, it is not obvious to me that arrayFilter (in rate_b) will happen before join or after, because if after then this is a lot of unnecessary work.

It should be assumed that the data in the time_series is disproportionately less than the data in the samples.

Sep 15 '22 10:09 R-omk

@R-omk After some discussion we created a custom env parameter for advanced users. https://github.com/metrico/qryn/pull/216 So you can create the new db with the correct primary key. Feel free to test.

Sep 15 '22 17:09 akvlad

The fix was added to the beta release (2.1.58-beta)

Dec 26 '22 18:12 akvlad

Feel free to reopen if still interested for 3.x

Nov 01 '23 22:11 lmangani

qryn qryn copied to clipboard

ORDER BY/ PRIMARY KEY for `samples` table

qryn
qryn copied to clipboard