loki
loki copied to clipboard
Memcached seems not working in Loki 2.9.7
Describe the bug I tried to add memcached for chunks and results in my loki cluster, but from memcached_exporter it seems they were not working at all. I think there should be something mis-configured, but I cannot tell where, no obvious log found.
To Reproduce Config as below:
chunk_store_config:
chunk_cache_config:
memcached:
batch_size: 256
parallelism: 10
memcached_client:
host: 'loki-memcached-chunks.<namespace>.svc.cluster.local'
service: 'loki-memcached-chunks'
addresses: 'loki-memcached-chunks.<namespace>.svc.cluster.local:11211'
query_range:
align_queries_with_step: true
cache_results: true
results_cache:
cache:
default_validity: 12h
memcached_client:
host: 'loki-memcached-results.<namespace>.svc.cluster.local'
service: 'loki-memcached-results'
addresses: 'loki-memcached-results.<namespace>.svc.cluster.local:11211'
timeout: 500ms
Expected behavior Memcached should be working.
Environment:
- Infrastructure: K8S
- Deployment tool: helm
- Version: 2.9.7
Screenshots, Promtail config, or terminal output
When I ran the query, here is from loki-read.log:
2024-04-21T18:33:58.183333503+08:00 stderr F level=warn ts=2024-04-21T10:33:58.091176695Z caller=experimental.go:20 msg="experimental feature in use" feature="DNS-based memcached service discovery" 2024-04-21T18:33:58.183340434+08:00 stderr F level=debug ts=2024-04-21T10:33:58.103712548Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers=10.96.219.81:11211 count=1 2024-04-21T18:33:58.183344901+08:00 stderr F level=warn ts=2024-04-21T10:33:58.104945074Z caller=experimental.go:20 msg="experimental feature in use" feature="DNS-based memcached service discovery" 2024-04-21T18:33:58.18335424+08:00 stderr F level=debug ts=2024-04-21T10:33:58.1114272Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers=10.96.22.143:11211 count=1 2024-04-21T18:34:58.182136789+08:00 stderr F level=debug ts=2024-04-21T10:34:58.107151558Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers=10.96.219.81:11211 count=1 2024-04-21T18:34:58.182174134+08:00 stderr F level=debug ts=2024-04-21T10:34:58.120451469Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers=10.96.22.143:11211 count=1 2024-04-21T18:35:58.181804079+08:00 stderr F level=debug ts=2024-04-21T10:35:58.106780861Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers=10.96.219.81:11211 count=1 2024-04-21T18:35:58.181826795+08:00 stderr F level=debug ts=2024-04-21T10:35:58.119872665Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers=10.96.22.143:11211 count=1 2024-04-21T18:36:58.182384859+08:00 stderr F level=debug ts=2024-04-21T10:36:58.111790906Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers=10.96.219.81:11211 count=1 2024-04-21T18:36:58.182417735+08:00 stderr F level=debug ts=2024-04-21T10:36:58.119808487Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers=10.96.22.143:11211 count=1
Did you enable memcached? See this section of the values.yaml file for an example.
@JStickler Thanks for the reply. Yes, I ran memcached in the same chart as Loki and use the "address" config, but I am not sure whether the config is correct. Here is the config after population:
chunk_store_config:
chunk_cache_config:
memcached:
batch_size: 256
parallelism: 10
memcached_client:
host: ""
service: "memcache"
addresses: 'loki-memcached-chunks.<namespace>.svc.cluster.local:11211'
query_range:
align_queries_with_step: true
cache_results: true
results_cache:
cache:
default_validity: 12h
memcached_client:
host: ""
service: "memcache"
addresses: 'loki-memcached-results.<namespace>.svc.cluster.local:11211'
timeout: 500ms
There were a few bugs around cache writeback, check these PR: https://github.com/grafana/loki/pull/12456 https://github.com/grafana/loki/pull/11776
The last one is the most relevant but TLDR there was a bug on how the queue size was calculated so after some time loki simply stops using memcached, affecting both performance and deduplication of chunks
You can check the metric loki_cache_background_queue_bytes, it will probably be constant on 500MB (default value)
@AndreZiviani , thank for the reply. It seems the issue had been fixed in 2.9.5 https://github.com/grafana/loki/issues/11821?
It was only mentioned on the 3.0 changelog but looks like you are right
Can you check your loki_cache_background_queue_bytes?
@AndreZiviani I did not find the parameter in either the config or doc, if you meant the parameters under cache_config -> background, then all of them are set as default values.
I am not sure whether my config of memcached_client is correct or not, could you please help confirm that?
It is a metric not a parameter I’m still on 2.9.4, will try upgrading today to the latest 2.9 and see how it behaves
@duj4 I believe your issue may be on the addresses filed, it should use the dns-service-discovery format
this is my config:
chunk_store_config:
cache_lookups_older_than: 3h
chunk_cache_config:
default_validity: 0s
embedded_cache:
enabled: false
memcached:
batch_size: 6
parallelism: 5
memcached_client:
addresses: dnssrv+_memcached-client._tcp.loki-memcached-chunks.loki.svc.cluster.local
consistent_hash: true
timeout: 10s
query_range:
align_queries_with_step: true
cache_index_stats_results: false
cache_results: true
max_retries: 5
results_cache:
cache:
default_validity: 0s
memcached:
batch_size: 6
parallelism: 5
memcached_client:
addresses: dnssrv+_memcached-client._tcp.loki-memcached-frontend.loki.svc.cluster.local
consistent_hash: true
timeout: 10s
@AndreZiviani Thanks, the metric loki_cache_background_queue_bytes is showing as 0, it might indicate the memcache is not working, this should be owning to the incorrect config.
- Set the service of memcached to headless, it was clusterip before.
(08:18 debug-tools:/) nslookup loki-memcached-results
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: loki-memcached-results.namespace.svc.cluster.local
Address: 172.16.85.200
Name: loki-memcached-results.namespace.svc.cluster.local
Address: 172.16.135.152
- I set the
addresseswithdnssrv+like below:
chunk_store_config:
chunk_cache_config:
memcached:
batch_size: 256
parallelism: 10
memcached_client:
addresses: 'dnssrv+_loki-memcached-chunks._tcp.loki-memcached-chunks.namespace.svc.cluster.local'
query_range:
align_queries_with_step: true
cache_results: true
results_cache:
cache:
default_validity: 12h
memcached_client:
addresses: 'dnssrv+_loki-memcached-results._tcp.loki-memcached-results.namespace.svc.cluster.local'
timeout: 500ms
But loki-read complained:
level=error ts=2024-05-11T05:41:47.063033939Z caller=resolver.go:100 msg="failed to lookup SRV records" host=_loki-memcached-results._tcp.loki-memcached-results.namespace.svc.cluster.local err="lookup _loki-memcached-results._tcp.loki-memcached-results.namespace.svc.cluster.local on 10.96.0.10:53: no such host" level=debug ts=2024-05-11T05:41:47.063084311Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers= count=0 level=error ts=2024-05-11T05:41:47.069787604Z caller=resolver.go:100 msg="failed to lookup SRV records" host=_loki-memcached-chunks._tcp.loki-memcached-chunks.namespace.svc.cluster.local err="lookup _loki-memcached-chunks._tcp.loki-memcached-chunks.namespace.svc.cluster.local on 10.96.0.10:53: no such host" level=debug ts=2024-05-11T05:41:47.069864355Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers= count=0
- I set the addresses back to
dns+:
chunk_store_config:
chunk_cache_config:
memcached:
batch_size: 256
parallelism: 10
memcached_client:
addresses: 'dns+loki-memcached-chunks.namespace.svc.cluster.local:11211'
query_range:
align_queries_with_step: true
cache_results: true
results_cache:
cache:
default_validity: 12h
memcached_client:
addresses: 'dns+loki-memcached-results.namespace.svc.cluster.local:11211'
timeout: 500ms
This time it does resolve the address of memcached:
level=debug ts=2024-05-11T08:26:00.136867785Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers=172.16.85.200:11211,172.16.135.152:11211 count=2 level=debug ts=2024-05-11T08:26:00.144243747Z caller=memcached_client_selector.go:105 msg="updating memcached servers" servers=172.16.32.134:11211,172.16.58.230:11211 count=2
But complaining while querying:
level=warn ts=2024-05-11T08:30:35.403495417Z caller=background.go:199 msg="backgroundCache writeBackLoop Cache.Store fail" err="server=172.16.85.200:11211: read tcp 172.16.85.211:35906->172.16.85.200:11211: read: connection reset by peer" level=warn ts=2024-05-11T08:30:35.403583334Z caller=background.go:199 msg="backgroundCache writeBackLoop Cache.Store fail" err="server=172.16.85.200:11211: read tcp 172.16.85.211:35900->172.16.85.200:11211: read: connection reset by peer" level=warn ts=2024-05-11T08:30:35.403639352Z caller=background.go:199 msg="backgroundCache writeBackLoop Cache.Store fail" err="server=172.16.135.152:11211: read tcp 172.16.85.211:41514->172.16.135.152:11211: read: connection reset by peer" level=warn ts=2024-05-11T08:30:35.403683419Z caller=background.go:199 msg="backgroundCache writeBackLoop Cache.Store fail" err="server=172.16.85.200:11211: read tcp 172.16.85.211:35894->172.16.85.200:11211: read: connection reset by peer"
loki-memcached-results-0:
29: going from conn_closing to conn_closed SSL connection failed with error code : 1 : No error information accept4(): Resource temporarily unavailable SSL connection failed with error code : 1 : No error information accept4(): Resource temporarily unavailable <27 new auto-negotiating client connection 27: going from conn_new_cmd to conn_waiting 27: going from conn_waiting to conn_read 27: going from conn_read to conn_closing <27 connection closed. 27: going from conn_closing to conn_closed SSL connection failed with error code : 1 : No error information accept4(): Resource temporarily unavailable
I have enabled TLS on memcached and all loki components, but the connection from loki-read was dropped by memcached server.
Well, I found Loki 2.9.x do not support TLS connection to memcached: https://grafana.com/docs/loki/v2.9.x/configure/#cache_config, but it starts supporting it in 3.x: https://grafana.com/docs/loki/latest/configure/#cache_config, someone asked the same in https://github.com/grafana/loki/issues/10548.
Once I commented all TLS related configuration in memcached objects, it seems the cache is working:
But the value of loki_cache_background_queue_bytes is still 0 in all 3 sections:
read:
HELP loki_cache_background_queue_bytes Amount of data in the background writeback queue. TYPE loki_cache_background_queue_bytes gauge loki_cache_background_queue_bytes{name="chunksmemcache"} 0 loki_cache_background_queue_bytes{name="frontend.memcache"} 0
write:
HELP loki_cache_background_queue_bytes Amount of data in the background writeback queue. TYPE loki_cache_background_queue_bytes gauge loki_cache_background_queue_bytes{name="chunksmemcache"} 0
backend:
HELP loki_cache_background_queue_bytes Amount of data in the background writeback queue. TYPE loki_cache_background_queue_bytes gauge loki_cache_background_queue_bytes{name="chunksmemcache"} 0
Did I miss something?
Cool!
I have enabled TLS on memcached and all loki components, but the connection from loki-read was dropped by memcached server. Well, I found Loki 2.9.x do not support TLS connection to memcached:
good to know!
But loki-read complained:
level=error ts=2024-05-11T05:41:47.063033939Z caller=resolver.go:100 msg="failed to lookup SRV records" host=_loki-memcached-results._tcp.loki-memcached-results.namespace.svc.cluster.local err="lookup _loki-memcached-results._tcp.loki-memcached-results.namespace.svc.cluster.local on 10.96.0.10:53: no such host"
this probably means that you declared a different name for the port, mine is memcached-client
btw mine is still ClusterIP
But the value of loki_cache_background_queue_bytes is still 0 in all 3 sections
that is ok, it just mean that either you have enough resources to flush the queue to memcached or you are still not actually using it, check these metrics: loki_cache_background_enqueued_bytes_total and loki_cache_background_dequeued_bytes_total to make sure it is actually using it
This is the query for dedup ratio (you may need to adjust labels):
sum(increase(loki_chunk_store_deduped_chunks_total{cluster="$cluster", job=~"($namespace)/ingester.*"}[$__rate_interval]))/sum(increase(loki_ingester_chunks_flushed_total{cluster="$cluster", job=~"($namespace)/ingester.*"}[$__rate_interval])) < 1
this probably means that you declared a different name for the port, mine is memcached-client btw mine is still ClusterIP
Thanks @AndreZiviani , finally I found out the root cause: the service in memcached_client is the port name instead of anything else. Once I changed it to mine, the log is quiet, no matter the service is headless or not. However, per https://github.com/bitnami/charts/issues/17154#issuecomment-1598305250, headless is the suggested one I think?
that is ok, it just mean that either you have enough resources to flush the queue to memcached or you are still not actually using it, check these metrics: loki_cache_background_enqueued_bytes_total and loki_cache_background_dequeued_bytes_total to make sure it is actually using it
I tried to search some older log files and I found the metrics from loki-read, that means the memcached is working, isn't it? There isn't too much explanation either from loki documentation or other places:
loki_cache_background_dequeued_bytes_total{name="chunksmemcache"} 0
loki_cache_background_dequeued_bytes_total{name="frontend.memcache"} 7832
loki_cache_background_enqueued_bytes_total{name="chunksmemcache"} 0
loki_cache_background_enqueued_bytes_total{name="frontend.memcache"} 7832
However, per https://github.com/bitnami/charts/issues/17154#issuecomment-1598305250, headless is the suggested one I think?
Depends on how you access it, if you just use the service endpoint it will only "see" one instance of memcached but I think Loki will get the underlying endpoints (targets of the service) and connect to all of them
I tried to search some older log files and I found the metrics from loki-read, that means the memcached is working, isn't it?
looks like it is, don't forget to also check the loki-write metric, the chunksmemcache should not be 0
Thanks @AndreZiviani , that really helps.