qryn
qryn copied to clipboard
ORDER BY/ PRIMARY KEY for `samples` table
I see that the primary key is chosen completely wrong for samples table. https://github.com/metrico/qryn/blob/161117c/lib/db/maintain/scripts.js#L13
This means that in one partition there will be completely different records with different fingerprints. In such a situation, there can be no effective compression due to heterogeneous content.
Moreover, fingerprint search is not efficient at all. For some prints, I have hundred millions lines of logs in an day. This leads to the fact that when searching for other fingerprints in the same time range, millions of spurious lines are scanned.
i'm sure the tuples should be ordered like this
ORDER BY (fingerprint, timestamp_ns)
Ok. Let's check a complicated clickhouse request and discuss which primary key we should use.
sum(rate({forwarder="vector"} | regexp "appname\":\"(?<a>[^\"]+)" [1m])) by (a)
WITH
idx_sel AS
(
SELECT sel_1.fingerprint
FROM
(
SELECT fingerprint
FROM cloki.time_series_gin
WHERE (key = 'forwarder') AND (val = 'vector')
) AS sel_1
),
rate_a AS
(
SELECT
samples.string AS string,
samples.fingerprint AS fingerprint,
intDiv(samples.timestamp_ns, 1000000) AS timestamp_ns,
JSONExtractKeysAndValues(time_series.labels, 'String') AS labels,
arrayFilter(x -> (((x.1) != '') AND ((x.2) != '')), arrayZip(['a'], arrayMap(x -> (x[length(x)]), extractAllGroupsHorizontal(string, 'appname":"([^"]+)')))) AS extra_labels
FROM cloki.samples_v3 AS samples
ANY LEFT JOIN cloki.time_series ON samples.fingerprint = time_series.fingerprint
WHERE ((samples.timestamp_ns >= 1663167960000000000) AND (samples.timestamp_ns <= 1663168320000000000)) AND (samples.fingerprint IN (idx_sel))
ORDER BY timestamp_ns DESC
),
rate_b AS
(
SELECT
arraySort(arrayConcat(arrayFilter(x -> (arrayExists(y -> ((y.1) = (x.1)), extra_labels) = 0), labels), extra_labels)) AS labels,
intDiv(timestamp_ns, 60000) * 60000 AS timestamp_ns,
(toFloat64(count(1)) * 1000) / 60000 AS value
FROM rate_a
GROUP BY
labels,
timestamp_ns
ORDER BY
labels ASC,
timestamp_ns ASC
),
agg_a AS
(
SELECT
labels,
intDiv(timestamp_ns, 100000) * 100000 AS timestamp_ns,
argMin(rate_b.value, rate_b.timestamp_ns) AS value
FROM rate_b
GROUP BY
labels,
timestamp_ns
ORDER BY
labels ASC,
timestamp_ns ASC
)
SELECT
arrayFilter(x -> ((x.1) IN ('a')), labels) AS labels,
timestamp_ns,
sum(value) AS value
FROM agg_a
GROUP BY
labels,
timestamp_ns
ORDER BY
labels ASC,
timestamp_ns ASC
The ordering as @R-omk is suggesting used to exist. It was removed in the commit https://github.com/metrico/qryn/commit/496195f6f0115fce7bbb305a6f4db280b8b34dff about a year ago because of OOM errors upon such complicated requests.
Actually I have been thinking lastly about some smart approach for this.
Let's say we have a more deterministic part of fingerprint in another field. It is an 8-bit hash of subgroup of labels based on your last requests. For example you frequently search by the {app="myapp"}
tag. Why can't the software remember that and optimize the newly incoming data accordingly?
@akvlad , please explain what exactly is the problem and under what conditions it occurs.
--
Let's say we have a more deterministic part.
as an option, it can exclude some labels from the fingerprint, but at the same time keep it in timeseries and gin as the second cascade of index. That approach may provide less cardinality for fingerprints. (if this is the problem that caused the schema to be changed)
rate_a AS
(
SELECT
samples.string AS string,
samples.fingerprint AS fingerprint,
intDiv(samples.timestamp_ns, 1000000) AS timestamp_ns,
JSONExtractKeysAndValues(time_series.labels, 'String') AS labels,
arrayFilter(x -> (((x.1) != '') AND ((x.2) != '')), arrayZip(['a'], arrayMap(x -> (x[length(x)]), extractAllGroupsHorizontal(string, 'appname":"([^"]+)')))) AS extra_labels
FROM cloki.samples_v3 AS samples
ANY LEFT JOIN cloki.time_series ON samples.fingerprint = time_series.fingerprint
WHERE ((samples.timestamp_ns >= 1663167960000000000) AND (samples.timestamp_ns <= 1663168320000000000)) AND (samples.fingerprint IN (idx_sel))
ORDER BY timestamp_ns DESC
),
It seems that left join
is not needed here at all, the labels can be retrieved in a separate subquery via IN (rate_a.fingerprint)
In addition, it is required to limit the subset of labels and remove duplicates.
It is difficult to understand how the optimizer will work on such a large query. I would probably use sessions and use temporary tables with a specialized engine to work with sets and/or joins. In this case, the execution plan becomes more predictable, including when working with sharding.
For example, it is not obvious to me that arrayFilter (in rate_b) will happen before join or after, because if after then this is a lot of unnecessary work.
It should be assumed that the data in the time_series is disproportionately less than the data in the samples.
@R-omk After some discussion we created a custom env parameter for advanced users. https://github.com/metrico/qryn/pull/216 So you can create the new db with the correct primary key. Feel free to test.
The fix was added to the beta release (2.1.58-beta)
Feel free to reopen if still interested for 3.x