qryn
qryn copied to clipboard
Metrics labels and values in loki logs
Qryn Prometheus metrics duplicated in loki requests with labels and their values
it's not a problem when you can add a filter, such as {job="loki", level=~""}
but if the log doesn't have a level, you have to come up with other ways to filter logs from metrics.
And then the number of labels starts to grow for loki as well.
A change is upcoming to avoid this side effect. @akvlad will provide updates once ready
@deathalt the feature is inside the branch https://github.com/metrico/qryn/tree/feat_423 . Merge and release will be done shortly.
Just reporting in- I updated our qryn instance to 3.1.5-bun and we're still seeing this issue.
Should be fixed in 3.2.0
@akvlad I updated to 3.2.0 and we're still seeing metrics appear as logs.
Thanks for the report @bcarlock-mycarrier could you provide some examples for a better understanding?
Sure:
deployment -
I'd be happy to provide more information if you can tell me what might be helpful. Right now we have two different qryn instances, one for metrics, one for logs in order to avoid this problem.
@bcarlock-mycarrier @deathalt let me recap what was changed in 3.2.0.
The extra env config was added to qryn: DISTINGUISH_LOGS_METRICS=1
.
After the environment is provided, the newly ingested information starts to be distinguished between logs and metrics.
All the existing information is still present in the both datasources.
Please recheck the config env vars for qryn.
I believe that it appeared somewhere in the new documentation. Will recheck and add if needed. Sorry for the inconvenience.
@akvlad
I have the same behavior.
CLICKHOUSE_SERVER: chproxy.chproxy.svc
CLICKHOUSE_PORT: 9090
CLICKHOUSE_DB: qryn
CLICKHOUSE_AUTH: default:
CLICKHOUSE_PROTO: http
CLICKHOUSE_TIMEFIELD: record_datetime
CLUSTER_NAME: shard
BULK_MAXAGE: 4000
BULK_MAXSIZE: 10000000
BULK_MAXCACHE: 100000
LABELS_DAYS: 7
SAMPLES_DAYS: 7
HOST: 0.0.0.0.0
PORT: 3100
FASTIFY_BODYLIMIT: 5242880
FASTIFY_REQUESTTIMEOUT: 30000
FASTIFY_MAXREQUESTS: 0
FASTIFY_METRICS: true
NODE_OPTIONS: "--max-old-space-size=61440"
DISTINGUISH_LOGS_METRICS: 1
fresh metrics are still in the logs.
DISTINGUISH_LOGS_METRICS=1 seems to have solved the problem for us. Which brings up the question, under what circumstance would you want this turned off? It doesn't seem like the sort of thing you'd usually put behind a feature flag.
@bcarlock-mycarrier I think its safe to say once we're past the testing confirmations it will become a default
Makes sense. I'm not sure if this is related or not, but I'm seeing somewhat reduced query performance after enabling that.
Interesting. @akvlad is any performance impact expected? or might this be due to the "dual-stack" support of both old and new types?
The type support is an extra IN
in terms of the sql.
IN is incredibly fast in clickhouse, like 100M comparisons a sec. But anyway it has its cost.
The most noticeable performance decrease is actually populating searchable labels and cross referencing them in grafana. Is there any caching possible? Here's a gif example. The gif below doesn't illustrate perfectly due to frame rate, but it should give you some idea. The initial load of the Metrics Browser link is much longer than it feels in the gif.
@bcarlock-mycarrier I see. How many time_series do you have?
SELECT count(distinct fingerprint) from time_series
And it would be great to get /series latency from you:
Please find the /api/v1/series
request in the logs:
{"level":30,"time":1706633214417,"pid":19,"hostname":"hromozeka-PC","name":"qryn","reqId":"req-e","req":{"method":"GET","url":"/api/v1/series?start=1706629615&end=1706633215&match%5B%5D=system_cpu_time_seconds_total%7B%7D","hostname":"localhost:3100","remoteAddress":"127.0.0.1","remotePort":38574},"msg":"incoming request"}
{"level":30,"time":1706633214469,"pid":19,"hostname":"hromozeka-PC","name":"qryn","reqId":"req-e","res":{"statusCode":200},"responseTime":51.44048300012946,"msg":"request completed"}
It would be nice to see "responseTime":51.44048300012946
for the /api/v1/series
requests.
Or if you have nginx or something in front, please share the timings.
We just turned this back on yesterday. The current count is 232,913, so not a ton honestly.
{"level":30,"time":1706633940075,"pid":24,"hostname":"qryn-metrics-reader-9977db669-72jv9","name":"qryn","reqId":"req-1i","req":{"method":"GET","url":"/api/v1/label/__name__/values?start=1706612340&end=1706634000","hostname":"qryn-metrics-reader.monitoring:3100","remoteAddress":"10.2.2.247","remotePort":34502},"msg":"incoming request"}
{"level":20,"time":1706633940075,"pid":24,"hostname":"qryn-metrics-reader-9977db669-72jv9","name":"qryn","reqId":"req-1i","msg":"GET /api/prom/label/__name__/values"}
{"level":30,"time":1706633940779,"pid":24,"hostname":"qryn-metrics-reader-9977db669-72jv9","name":"qryn","reqId":"req-1i","res":{"statusCode":200},"responseTime":703.5259080082178,"msg":"request completed"}
{"level":30,"time":1706633942002,"pid":24,"hostname":"qryn-metrics-reader-9977db669-72jv9","name":"qryn","reqId":"req-1j","req":{"method":"GET","url":"/api/v1/labels?start=1706612340&end=1706634000","hostname":"qryn-metrics-reader.monitoring:3100","remoteAddress":"10.2.2.247","remotePort":34502},"msg":"incoming request"}
{"level":20,"time":1706633942003,"pid":24,"hostname":"qryn-metrics-reader-9977db669-72jv9","name":"qryn","reqId":"req-1j","msg":"GET /loki/api/v1/label"}
SELECT DISTINCT key FROM time_series_gin WHERE date >= toDate(FROM_UNIXTIME(intDiv(1706612340000000000, 1000000000))) AND date <= toDate(FROM_UNIXTIME(intDiv(1706634000000000000, 1000000000))) AND type IN (0,0) FORMAT JSON
{"level":30,"time":1706633942393,"pid":24,"hostname":"qryn-metrics-reader-9977db669-72jv9","name":"qryn","reqId":"req-1j","res":{"statusCode":200},"responseTime":390.80939300358295,"msg":"request completed"}
I have a quick addition to this issue. Since updating and enabling DISTINGUISH_LOGS_METRICS we've also noticed some strangeness related to the Loki interface label correlation. The first label selected is always a label from logging. The second label is always what appears to be a metric tag. The third label tends to be a log label again.
Thanks for the report @bcarlock-mycarrier 👍 @akvlad and the team will investigate and provide feedback
@bcarlock-mycarrier we cannot reproduce this condition. Please open a separate issue with all the case details. If you're running multiple instances or multiple readers/writers make sure they are all featuring the same settings.
This should be resolved in the latest releases. Please reopen if the issue is still relevant.
The problem was stayed with metrics labels in logs.