qryn icon indicating copy to clipboard operation
qryn copied to clipboard

Metrics labels and values in loki logs

Open deathalt opened this issue 1 year ago • 21 comments

Qryn Prometheus metrics duplicated in loki requests with labels and their values

image image

it's not a problem when you can add a filter, such as {job="loki", level=~""} but if the log doesn't have a level, you have to come up with other ways to filter logs from metrics. And then the number of labels starts to grow for loki as well.

deathalt avatar Jan 09 '24 14:01 deathalt

A change is upcoming to avoid this side effect. @akvlad will provide updates once ready

lmangani avatar Jan 11 '24 11:01 lmangani

@deathalt the feature is inside the branch https://github.com/metrico/qryn/tree/feat_423 . Merge and release will be done shortly.

akvlad avatar Jan 26 '24 12:01 akvlad

Just reporting in- I updated our qryn instance to 3.1.5-bun and we're still seeing this issue. image image

bcarlock-mycarrier avatar Jan 29 '24 17:01 bcarlock-mycarrier

Should be fixed in 3.2.0

akvlad avatar Jan 29 '24 19:01 akvlad

@akvlad I updated to 3.2.0 and we're still seeing metrics appear as logs.

bcarlock-mycarrier avatar Jan 29 '24 22:01 bcarlock-mycarrier

Thanks for the report @bcarlock-mycarrier could you provide some examples for a better understanding?

lmangani avatar Jan 29 '24 23:01 lmangani

Sure: deployment - image image

bcarlock-mycarrier avatar Jan 29 '24 23:01 bcarlock-mycarrier

I'd be happy to provide more information if you can tell me what might be helpful. Right now we have two different qryn instances, one for metrics, one for logs in order to avoid this problem.

bcarlock-mycarrier avatar Jan 29 '24 23:01 bcarlock-mycarrier

@bcarlock-mycarrier @deathalt let me recap what was changed in 3.2.0.

The extra env config was added to qryn: DISTINGUISH_LOGS_METRICS=1. After the environment is provided, the newly ingested information starts to be distinguished between logs and metrics. All the existing information is still present in the both datasources. Please recheck the config env vars for qryn.

I believe that it appeared somewhere in the new documentation. Will recheck and add if needed. Sorry for the inconvenience.

akvlad avatar Jan 30 '24 08:01 akvlad

@akvlad

I have the same behavior.

CLICKHOUSE_SERVER: chproxy.chproxy.svc
  CLICKHOUSE_PORT: 9090
  CLICKHOUSE_DB: qryn
  CLICKHOUSE_AUTH: default:
  CLICKHOUSE_PROTO: http
  CLICKHOUSE_TIMEFIELD: record_datetime
  CLUSTER_NAME: shard
  BULK_MAXAGE: 4000
  BULK_MAXSIZE: 10000000
  BULK_MAXCACHE: 100000
  LABELS_DAYS: 7
  SAMPLES_DAYS: 7
  HOST: 0.0.0.0.0
  PORT: 3100
  FASTIFY_BODYLIMIT: 5242880
  FASTIFY_REQUESTTIMEOUT: 30000
  FASTIFY_MAXREQUESTS: 0
  FASTIFY_METRICS: true
  NODE_OPTIONS: "--max-old-space-size=61440"
  DISTINGUISH_LOGS_METRICS: 1

fresh metrics are still in the logs. image

deathalt avatar Jan 30 '24 11:01 deathalt

DISTINGUISH_LOGS_METRICS=1 seems to have solved the problem for us. Which brings up the question, under what circumstance would you want this turned off? It doesn't seem like the sort of thing you'd usually put behind a feature flag.

bcarlock-mycarrier avatar Jan 30 '24 14:01 bcarlock-mycarrier

@bcarlock-mycarrier I think its safe to say once we're past the testing confirmations it will become a default

lmangani avatar Jan 30 '24 15:01 lmangani

Makes sense. I'm not sure if this is related or not, but I'm seeing somewhat reduced query performance after enabling that.

bcarlock-mycarrier avatar Jan 30 '24 15:01 bcarlock-mycarrier

Interesting. @akvlad is any performance impact expected? or might this be due to the "dual-stack" support of both old and new types?

lmangani avatar Jan 30 '24 15:01 lmangani

The type support is an extra IN in terms of the sql. IN is incredibly fast in clickhouse, like 100M comparisons a sec. But anyway it has its cost.

akvlad avatar Jan 30 '24 16:01 akvlad

The most noticeable performance decrease is actually populating searchable labels and cross referencing them in grafana. Is there any caching possible? Here's a gif example. The gif below doesn't illustrate perfectly due to frame rate, but it should give you some idea. The initial load of the Metrics Browser link is much longer than it feels in the gif. example

bcarlock-mycarrier avatar Jan 30 '24 16:01 bcarlock-mycarrier

@bcarlock-mycarrier I see. How many time_series do you have?

SELECT count(distinct fingerprint) from time_series

And it would be great to get /series latency from you: Please find the /api/v1/series request in the logs:

{"level":30,"time":1706633214417,"pid":19,"hostname":"hromozeka-PC","name":"qryn","reqId":"req-e","req":{"method":"GET","url":"/api/v1/series?start=1706629615&end=1706633215&match%5B%5D=system_cpu_time_seconds_total%7B%7D","hostname":"localhost:3100","remoteAddress":"127.0.0.1","remotePort":38574},"msg":"incoming request"}
{"level":30,"time":1706633214469,"pid":19,"hostname":"hromozeka-PC","name":"qryn","reqId":"req-e","res":{"statusCode":200},"responseTime":51.44048300012946,"msg":"request completed"}

It would be nice to see "responseTime":51.44048300012946 for the /api/v1/series requests.

Or if you have nginx or something in front, please share the timings.

akvlad avatar Jan 30 '24 16:01 akvlad

We just turned this back on yesterday. The current count is 232,913, so not a ton honestly.

{"level":30,"time":1706633940075,"pid":24,"hostname":"qryn-metrics-reader-9977db669-72jv9","name":"qryn","reqId":"req-1i","req":{"method":"GET","url":"/api/v1/label/__name__/values?start=1706612340&end=1706634000","hostname":"qryn-metrics-reader.monitoring:3100","remoteAddress":"10.2.2.247","remotePort":34502},"msg":"incoming request"}
{"level":20,"time":1706633940075,"pid":24,"hostname":"qryn-metrics-reader-9977db669-72jv9","name":"qryn","reqId":"req-1i","msg":"GET /api/prom/label/__name__/values"}
{"level":30,"time":1706633940779,"pid":24,"hostname":"qryn-metrics-reader-9977db669-72jv9","name":"qryn","reqId":"req-1i","res":{"statusCode":200},"responseTime":703.5259080082178,"msg":"request completed"}
{"level":30,"time":1706633942002,"pid":24,"hostname":"qryn-metrics-reader-9977db669-72jv9","name":"qryn","reqId":"req-1j","req":{"method":"GET","url":"/api/v1/labels?start=1706612340&end=1706634000","hostname":"qryn-metrics-reader.monitoring:3100","remoteAddress":"10.2.2.247","remotePort":34502},"msg":"incoming request"}
{"level":20,"time":1706633942003,"pid":24,"hostname":"qryn-metrics-reader-9977db669-72jv9","name":"qryn","reqId":"req-1j","msg":"GET /loki/api/v1/label"}
SELECT DISTINCT key FROM time_series_gin WHERE date >= toDate(FROM_UNIXTIME(intDiv(1706612340000000000, 1000000000))) AND date <= toDate(FROM_UNIXTIME(intDiv(1706634000000000000, 1000000000))) AND type IN (0,0) FORMAT JSON
{"level":30,"time":1706633942393,"pid":24,"hostname":"qryn-metrics-reader-9977db669-72jv9","name":"qryn","reqId":"req-1j","res":{"statusCode":200},"responseTime":390.80939300358295,"msg":"request completed"}

bcarlock-mycarrier avatar Jan 30 '24 17:01 bcarlock-mycarrier

I have a quick addition to this issue. Since updating and enabling DISTINGUISH_LOGS_METRICS we've also noticed some strangeness related to the Loki interface label correlation. The first label selected is always a label from logging. The second label is always what appears to be a metric tag. The third label tends to be a log label again. labelissues

bcarlock-mycarrier avatar Feb 07 '24 16:02 bcarlock-mycarrier

Thanks for the report @bcarlock-mycarrier 👍 @akvlad and the team will investigate and provide feedback

lmangani avatar Feb 07 '24 17:02 lmangani

@bcarlock-mycarrier we cannot reproduce this condition. Please open a separate issue with all the case details. If you're running multiple instances or multiple readers/writers make sure they are all featuring the same settings.

lmangani avatar Feb 08 '24 15:02 lmangani

This should be resolved in the latest releases. Please reopen if the issue is still relevant.

lmangani avatar Mar 19 '24 20:03 lmangani

The problem was stayed with metrics labels in logs.

mk-raven avatar Mar 22 '24 10:03 mk-raven