loki icon indicating copy to clipboard operation
loki copied to clipboard

Memcached seems not working in Loki 2.9.7

Open duj4 opened this issue 1 year ago • 2 comments

Describe the bug I tried to add memcached for chunks and results in my loki cluster, but from memcached_exporter it seems they were not working at all. I think there should be something mis-configured, but I cannot tell where, no obvious log found.

To Reproduce Config as below:

    chunk_store_config:
      chunk_cache_config:
        memcached:
          batch_size: 256
          parallelism: 10
        memcached_client:
          host: 'loki-memcached-chunks.<namespace>.svc.cluster.local'
          service: 'loki-memcached-chunks'
          addresses: 'loki-memcached-chunks.<namespace>.svc.cluster.local:11211'

    query_range:
      align_queries_with_step: true
      cache_results: true
      results_cache:
        cache:
          default_validity: 12h
          memcached_client:
            host: 'loki-memcached-results.<namespace>.svc.cluster.local'
            service: 'loki-memcached-results'
            addresses: 'loki-memcached-results.<namespace>.svc.cluster.local:11211'
            timeout: 500ms

Expected behavior Memcached should be working.

Environment:

  • Infrastructure: K8S
  • Deployment tool: helm
  • Version: 2.9.7

Screenshots, Promtail config, or terminal output image

When I ran the query, here is from loki-read.log:

2024-04-21T18:33:58.183333503+08:00 stderr F level=warn ts=2024-04-21T10:33:58.091176695Z caller=experimental.go:20 msg="experimental feature in use" feature="DNS-based memcached service discovery" 2024-04-21T18:33:58.183340434+08:00 stderr F level=debug ts=2024-04-21T10:33:58.103712548Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers=10.96.219.81:11211 count=1 2024-04-21T18:33:58.183344901+08:00 stderr F level=warn ts=2024-04-21T10:33:58.104945074Z caller=experimental.go:20 msg="experimental feature in use" feature="DNS-based memcached service discovery" 2024-04-21T18:33:58.18335424+08:00 stderr F level=debug ts=2024-04-21T10:33:58.1114272Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers=10.96.22.143:11211 count=1 2024-04-21T18:34:58.182136789+08:00 stderr F level=debug ts=2024-04-21T10:34:58.107151558Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers=10.96.219.81:11211 count=1 2024-04-21T18:34:58.182174134+08:00 stderr F level=debug ts=2024-04-21T10:34:58.120451469Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers=10.96.22.143:11211 count=1 2024-04-21T18:35:58.181804079+08:00 stderr F level=debug ts=2024-04-21T10:35:58.106780861Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers=10.96.219.81:11211 count=1 2024-04-21T18:35:58.181826795+08:00 stderr F level=debug ts=2024-04-21T10:35:58.119872665Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers=10.96.22.143:11211 count=1 2024-04-21T18:36:58.182384859+08:00 stderr F level=debug ts=2024-04-21T10:36:58.111790906Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers=10.96.219.81:11211 count=1 2024-04-21T18:36:58.182417735+08:00 stderr F level=debug ts=2024-04-21T10:36:58.119808487Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers=10.96.22.143:11211 count=1

duj4 avatar Apr 21 '24 10:04 duj4

Did you enable memcached? See this section of the values.yaml file for an example.

JStickler avatar Apr 29 '24 20:04 JStickler

@JStickler Thanks for the reply. Yes, I ran memcached in the same chart as Loki and use the "address" config, but I am not sure whether the config is correct. Here is the config after population:

chunk_store_config:
  chunk_cache_config:
    memcached:
      batch_size: 256
      parallelism: 10
    memcached_client:
      host: ""
      service: "memcache"
      addresses: 'loki-memcached-chunks.<namespace>.svc.cluster.local:11211'

query_range:
  align_queries_with_step: true
  cache_results: true
  results_cache:
    cache:
      default_validity: 12h
      memcached_client:
        host: ""
        service: "memcache"
        addresses: 'loki-memcached-results.<namespace>.svc.cluster.local:11211'
        timeout: 500ms

duj4 avatar Apr 30 '24 00:04 duj4

There were a few bugs around cache writeback, check these PR: https://github.com/grafana/loki/pull/12456 https://github.com/grafana/loki/pull/11776

The last one is the most relevant but TLDR there was a bug on how the queue size was calculated so after some time loki simply stops using memcached, affecting both performance and deduplication of chunks You can check the metric loki_cache_background_queue_bytes, it will probably be constant on 500MB (default value)

AndreZiviani avatar May 09 '24 16:05 AndreZiviani

@AndreZiviani , thank for the reply. It seems the issue had been fixed in 2.9.5 https://github.com/grafana/loki/issues/11821?

duj4 avatar May 10 '24 05:05 duj4

It was only mentioned on the 3.0 changelog but looks like you are right Can you check your loki_cache_background_queue_bytes?

AndreZiviani avatar May 10 '24 10:05 AndreZiviani

@AndreZiviani I did not find the parameter in either the config or doc, if you meant the parameters under cache_config -> background, then all of them are set as default values.

I am not sure whether my config of memcached_client is correct or not, could you please help confirm that?

duj4 avatar May 10 '24 12:05 duj4

It is a metric not a parameter I’m still on 2.9.4, will try upgrading today to the latest 2.9 and see how it behaves

AndreZiviani avatar May 10 '24 12:05 AndreZiviani

@duj4 I believe your issue may be on the addresses filed, it should use the dns-service-discovery format

this is my config:

    chunk_store_config:
      cache_lookups_older_than: 3h
      chunk_cache_config:
        default_validity: 0s
        embedded_cache:
          enabled: false
        memcached:
          batch_size: 6
          parallelism: 5
        memcached_client:
          addresses: dnssrv+_memcached-client._tcp.loki-memcached-chunks.loki.svc.cluster.local
          consistent_hash: true
          timeout: 10s

    query_range:
      align_queries_with_step: true
      cache_index_stats_results: false
      cache_results: true
      max_retries: 5
      results_cache:
        cache:
          default_validity: 0s
          memcached:
            batch_size: 6
            parallelism: 5
          memcached_client:
            addresses: dnssrv+_memcached-client._tcp.loki-memcached-frontend.loki.svc.cluster.local
            consistent_hash: true
            timeout: 10s

AndreZiviani avatar May 10 '24 14:05 AndreZiviani

@AndreZiviani Thanks, the metric loki_cache_background_queue_bytes is showing as 0, it might indicate the memcache is not working, this should be owning to the incorrect config.

  1. Set the service of memcached to headless, it was clusterip before.
(08:18 debug-tools:/) nslookup loki-memcached-results
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   loki-memcached-results.namespace.svc.cluster.local
Address: 172.16.85.200
Name:   loki-memcached-results.namespace.svc.cluster.local
Address: 172.16.135.152
  1. I set the addresses with dnssrv+ like below:
    chunk_store_config:
      chunk_cache_config:
        memcached:
          batch_size: 256
          parallelism: 10
        memcached_client:
          addresses: 'dnssrv+_loki-memcached-chunks._tcp.loki-memcached-chunks.namespace.svc.cluster.local'

    query_range:
      align_queries_with_step: true
      cache_results: true
      results_cache:
        cache:
          default_validity: 12h
          memcached_client:
            addresses: 'dnssrv+_loki-memcached-results._tcp.loki-memcached-results.namespace.svc.cluster.local'
            timeout: 500ms

But loki-read complained:

level=error ts=2024-05-11T05:41:47.063033939Z caller=resolver.go:100 msg="failed to lookup SRV records" host=_loki-memcached-results._tcp.loki-memcached-results.namespace.svc.cluster.local err="lookup _loki-memcached-results._tcp.loki-memcached-results.namespace.svc.cluster.local on 10.96.0.10:53: no such host" level=debug ts=2024-05-11T05:41:47.063084311Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers= count=0 level=error ts=2024-05-11T05:41:47.069787604Z caller=resolver.go:100 msg="failed to lookup SRV records" host=_loki-memcached-chunks._tcp.loki-memcached-chunks.namespace.svc.cluster.local err="lookup _loki-memcached-chunks._tcp.loki-memcached-chunks.namespace.svc.cluster.local on 10.96.0.10:53: no such host" level=debug ts=2024-05-11T05:41:47.069864355Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers= count=0

  1. I set the addresses back to dns+:
    chunk_store_config:
      chunk_cache_config:
        memcached:
          batch_size: 256
          parallelism: 10
        memcached_client:
          addresses: 'dns+loki-memcached-chunks.namespace.svc.cluster.local:11211'

    query_range:
      align_queries_with_step: true
      cache_results: true
      results_cache:
        cache:
          default_validity: 12h
          memcached_client:
            addresses: 'dns+loki-memcached-results.namespace.svc.cluster.local:11211'
            timeout: 500ms

This time it does resolve the address of memcached:

level=debug ts=2024-05-11T08:26:00.136867785Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers=172.16.85.200:11211,172.16.135.152:11211 count=2 level=debug ts=2024-05-11T08:26:00.144243747Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers=172.16.32.134:11211,172.16.58.230:11211 count=2

But complaining while querying:

level=warn ts=2024-05-11T08:30:35.403495417Z caller=background.go:199 msg="backgroundCache writeBackLoop Cache.Store fail" err="server=172.16.85.200:11211: read tcp 172.16.85.211:35906->172.16.85.200:11211: read: connection reset by peer" level=warn ts=2024-05-11T08:30:35.403583334Z caller=background.go:199 msg="backgroundCache writeBackLoop Cache.Store fail" err="server=172.16.85.200:11211: read tcp 172.16.85.211:35900->172.16.85.200:11211: read: connection reset by peer" level=warn ts=2024-05-11T08:30:35.403639352Z caller=background.go:199 msg="backgroundCache writeBackLoop Cache.Store fail" err="server=172.16.135.152:11211: read tcp 172.16.85.211:41514->172.16.135.152:11211: read: connection reset by peer" level=warn ts=2024-05-11T08:30:35.403683419Z caller=background.go:199 msg="backgroundCache writeBackLoop Cache.Store fail" err="server=172.16.85.200:11211: read tcp 172.16.85.211:35894->172.16.85.200:11211: read: connection reset by peer"

loki-memcached-results-0:

29: going from conn_closing to conn_closed SSL connection failed with error code : 1 : No error information accept4(): Resource temporarily unavailable SSL connection failed with error code : 1 : No error information accept4(): Resource temporarily unavailable <27 new auto-negotiating client connection 27: going from conn_new_cmd to conn_waiting 27: going from conn_waiting to conn_read 27: going from conn_read to conn_closing <27 connection closed. 27: going from conn_closing to conn_closed SSL connection failed with error code : 1 : No error information accept4(): Resource temporarily unavailable

I have enabled TLS on memcached and all loki components, but the connection from loki-read was dropped by memcached server.

Well, I found Loki 2.9.x do not support TLS connection to memcached: https://grafana.com/docs/loki/v2.9.x/configure/#cache_config, but it starts supporting it in 3.x: https://grafana.com/docs/loki/latest/configure/#cache_config, someone asked the same in https://github.com/grafana/loki/issues/10548.

Once I commented all TLS related configuration in memcached objects, it seems the cache is working: image image

But the value of loki_cache_background_queue_bytes is still 0 in all 3 sections: read:

HELP loki_cache_background_queue_bytes Amount of data in the background writeback queue. TYPE loki_cache_background_queue_bytes gauge loki_cache_background_queue_bytes{name="chunksmemcache"} 0 loki_cache_background_queue_bytes{name="frontend.memcache"} 0

write:

HELP loki_cache_background_queue_bytes Amount of data in the background writeback queue. TYPE loki_cache_background_queue_bytes gauge loki_cache_background_queue_bytes{name="chunksmemcache"} 0

backend:

HELP loki_cache_background_queue_bytes Amount of data in the background writeback queue. TYPE loki_cache_background_queue_bytes gauge loki_cache_background_queue_bytes{name="chunksmemcache"} 0

Did I miss something?

duj4 avatar May 11 '24 01:05 duj4

Cool!

I have enabled TLS on memcached and all loki components, but the connection from loki-read was dropped by memcached server. Well, I found Loki 2.9.x do not support TLS connection to memcached:

good to know!

But loki-read complained:

level=error ts=2024-05-11T05:41:47.063033939Z caller=resolver.go:100 msg="failed to lookup SRV records" host=_loki-memcached-results._tcp.loki-memcached-results.namespace.svc.cluster.local err="lookup _loki-memcached-results._tcp.loki-memcached-results.namespace.svc.cluster.local on 10.96.0.10:53: no such host"

this probably means that you declared a different name for the port, mine is memcached-client btw mine is still ClusterIP

But the value of loki_cache_background_queue_bytes is still 0 in all 3 sections

that is ok, it just mean that either you have enough resources to flush the queue to memcached or you are still not actually using it, check these metrics: loki_cache_background_enqueued_bytes_total and loki_cache_background_dequeued_bytes_total to make sure it is actually using it

This is the query for dedup ratio (you may need to adjust labels):

sum(increase(loki_chunk_store_deduped_chunks_total{cluster="$cluster", job=~"($namespace)/ingester.*"}[$__rate_interval]))/sum(increase(loki_ingester_chunks_flushed_total{cluster="$cluster", job=~"($namespace)/ingester.*"}[$__rate_interval])) < 1

AndreZiviani avatar May 13 '24 14:05 AndreZiviani

this probably means that you declared a different name for the port, mine is memcached-client btw mine is still ClusterIP

Thanks @AndreZiviani , finally I found out the root cause: the service in memcached_client is the port name instead of anything else. Once I changed it to mine, the log is quiet, no matter the service is headless or not. However, per https://github.com/bitnami/charts/issues/17154#issuecomment-1598305250, headless is the suggested one I think?

that is ok, it just mean that either you have enough resources to flush the queue to memcached or you are still not actually using it, check these metrics: loki_cache_background_enqueued_bytes_total and loki_cache_background_dequeued_bytes_total to make sure it is actually using it

I tried to search some older log files and I found the metrics from loki-read, that means the memcached is working, isn't it? There isn't too much explanation either from loki documentation or other places:

loki_cache_background_dequeued_bytes_total{name="chunksmemcache"} 0
loki_cache_background_dequeued_bytes_total{name="frontend.memcache"} 7832
loki_cache_background_enqueued_bytes_total{name="chunksmemcache"} 0
loki_cache_background_enqueued_bytes_total{name="frontend.memcache"} 7832

duj4 avatar May 14 '24 17:05 duj4

However, per https://github.com/bitnami/charts/issues/17154#issuecomment-1598305250, headless is the suggested one I think?

Depends on how you access it, if you just use the service endpoint it will only "see" one instance of memcached but I think Loki will get the underlying endpoints (targets of the service) and connect to all of them

I tried to search some older log files and I found the metrics from loki-read, that means the memcached is working, isn't it?

looks like it is, don't forget to also check the loki-write metric, the chunksmemcache should not be 0

AndreZiviani avatar May 14 '24 17:05 AndreZiviani

Thanks @AndreZiviani , that really helps.

duj4 avatar May 15 '24 01:05 duj4