loki icon indicating copy to clipboard operation
loki copied to clipboard

Large queries cause large memcached delays which then causes 2-3x resource consumption and 2-3x worse query performance

Open wiardvanrij opened this issue 2 years ago • 20 comments

Intro

I'm trying to debug (and reproduce) some performance issues we are having. Eventually I managed to reproduce the behaviour, which comes down to how memcached behaves and how its timeouts are handled.

Summary is:

  1. Do an expensive query (like sum by(your-label) (rate({your-label="foo"} |= `` [$__interval]))
  2. The ^ rate should return about like ~4-5k/s to have a "larger" load to test against
  3. Do the query over the last hour
  4. Have a memcached setup
  5. Notice Queriers getting logs like:
level=warn ts=2022-12-12T10:01:57.938104237Z caller=memcached.go:220 msg="failed to put to memcached" name=chunks err="server=100.98.56.68:11211: read tcp 100.97.243.247:60382->100.98.56.68:11211: i/o timeout"
level=warn ts=2022-12-12T10:01:52.340184048Z caller=memcached.go:220 msg="failed to put to memcached" name=chunks err="server=100.98.56.68:11211: read tcp 100.98.113.236:38376->100.98.56.68:11211: i/o timeout"
level=warn ts=2022-12-12T10:01:52.345175685Z caller=memcached.go:220 msg="failed to put to memcached" name=chunks err="server=100.98.56.68:11211: read tcp 100.98.113.236:38382->100.98.56.68:11211: i/o timeout"
level=warn ts=2022-12-12T10:01:57.946012867Z caller=background.go:127 msg="backgroundCache writeBackLoop Cache.Store fail" err="server=100.98.56.68:11211: read tcp 100.97.243.247:60444->100.98.56.68:11211: i/o timeout"
  1. Timeouts cause performance issues and increased load in cpu/memory

Memcached related config:

        memcached:
          batch_size: 256
          parallelism: 10
        memcached_client:
          timeout: 500ms
          max_idle_conns: 60
          consistent_hash: true

Why not increase timeout?

As you can see, I've already increased it from the default 100ms to 500ms - Which IMHO is already quite substantial. I even tried to set it to 1500ms (1.5 seconds for memcached?..) but while it does "help" a little bit, depending on the size of the query you are doing, I only need to increase this further and further.

Scale memcached horizontally?

With above query + settings, I've already scaled memcached to 12 instances. The problem is that while it does help spreading the load, it does not scale properly. Any larger querier needs even more replicas. At the point that I might need to scale it to 40 or even 100 memcached instances for it not to error

More importantly: the effects:

With 50000ms timeout

image
Stats
Total request time  8.23 s
Number of queries 1
Total number rows 1801
Data source stats
Summary: bytes processed per second 5.45 GB/s
Summary: lines processed per second 5340548
Summary: total bytes processed  38.9 GB
Summary: total lines processed  38110633
Summary: exec time  7.14 s
Ingester: total reached 144
Ingester: total chunks matched  1279
Ingester: total batches 7553
Ingester: total lines sent  3859683
Ingester: head chunk bytes  0 B
Ingester: head chunk lines  0
Ingester: decompressed bytes  0 B
Ingester: decompressed lines  0
Ingester: compressed bytes  0 B
Ingester: total duplicates  0

With 500ms timeout

image
Stats
Total request time  19.5 s
Number of queries 1
Total number rows 1801
Data source stats
Summary: bytes processed per second 2.20 GB/s
Summary: lines processed per second 2157152
Summary: total bytes processed  42.3 GB
Summary: total lines processed  41504927
Summary: exec time  19.2 s
Ingester: total reached 144
Ingester: total chunks matched  1736
Ingester: total batches 10065
Ingester: total lines sent  5139368
Ingester: head chunk bytes  0 B
Ingester: head chunk lines  0
Ingester: decompressed bytes  0 B
Ingester: decompressed lines  0
Ingester: compressed bytes  0 B
Ingester: total duplicates  0

Extra info

The example actually shows only 1 querier "go out of bounds" basically. Yet, if the query is larger, you can see the effect on every querier which basically increases cpu and memory usage by 2-3x for all queriers. Plus it also impacts query performance by 2-3 as well.

The bug?

IMO if the timeout is 'low' it should not impact the performance and resources that much. Furthermore, why would memcached hit such high latency by doing such a query? IMO memcached should be capable of handling a large amount of requests. Yet it's also not that we are not pushing it with 1000's of concurrent requests.

Then furthermore, it seems that increasing the timeout incredible stupidly high, it increases performance only more. I.e. it almost looks as if the timeout should be equal to the full query time rather than just the memcached call.

Versions

memcached:1.6.17-alpine loki:2.6.1 Runs on K8s, but also reproduced it locally.

wiardvanrij avatar Dec 12 '22 10:12 wiardvanrij

We are seeing this symptom also.

jphx avatar Jan 23 '23 15:01 jphx

Same issue..

dansl1982 avatar Jan 23 '23 15:01 dansl1982

same problem

fumanne avatar Mar 22 '23 05:03 fumanne

same problem

tkblack avatar May 24 '23 08:05 tkblack

Same problem with v2.7.3

alexandre1984rj avatar May 25 '23 21:05 alexandre1984rj

Stumbled above this issue too. Same with loki 2.8.2 and memcached 1.6.20.

vquie avatar May 31 '23 12:05 vquie

Same with loki 2.8.2 and memcached 1.6.17. on AWS EC2 with Elasticache

a-patos avatar Jul 11 '23 19:07 a-patos

same here.

lucianocoutinho avatar Aug 03 '23 15:08 lucianocoutinho

same problem with Loki 2.8.3 and memcached 1.6.21

SenlerBot avatar Aug 05 '23 13:08 SenlerBot

same problem Chart version: 5.11.0 Loki version: 2.8.4

mbonanata avatar Aug 16 '23 20:08 mbonanata

We were discussing this in the memcached discord: https://discord.com/channels/634311773750689794/1144428486107336724/1144501450282639430

It's difficult to understand what's going on here without your memcached config, and an understanding of your memcached setup. Is it one server or hundreds? Was there network contention? etc

dannykopping avatar Aug 27 '23 14:08 dannykopping

bump @wiardvanrij

gespi1 avatar Sep 13 '23 18:09 gespi1

Hi, is there a takeway from this discussion on discord ? We are experiencing poor cache hit ratios when doing large queries on data that is (supposedly) in the extended cache and I was wondering if the timeout errors we see in the logs of loki explain the cache misses. Also, should we put the timeout to an incredibely high value ?

saule1508 avatar Nov 04 '23 04:11 saule1508

Hi folks. I'm leaving this here for the thread and googlers who come in here: https://github.com/memcached/memcached/wiki/ConfiguringLokiExtstore

TL;DR: Set your memcached timeout very high and lower the batch_size and parallelism down to very low numbers.

This is specifically about loki and extstore, but it discusses the timeout issue a lot of you seem to be running into with vanilla memcached.

To reiterate here: the timeout value is actually the amount of time to read and parse the entire response batch from each individual server. If you have a small number of memcached servers (ie; 1), and a high batch_size (like the default 1024), you would be trying to fetch up to 1024 1.5 megabyte chunks over the network from a single host and process the results within timeout time.

Larger clusters probably don't notice this, since each individual memcached gets a slice of batch_size keys at once.

Hints as to tuning this are in the linked wiki page. There are some things Loki can do to improve the situation so we'll see how that shakes out.

have fun.

dormando avatar Nov 10 '23 01:11 dormando

I've created #11345 to address this issue now that we have a better handle on the problem, courtesy of @dormando's detective work.

dannykopping avatar Dec 01 '23 04:12 dannykopping

The PR has been merged; I'll be vendoring this code into Loki soon™️ and we'll cut a release after validating this in production

dannykopping avatar Dec 04 '23 15:12 dannykopping

@dannykopping Any update on the Loki Release...

Hitesh-Agrawal avatar Jan 19 '24 11:01 Hitesh-Agrawal

@Hitesh-Agrawal probably the next minor release, which would be sometime in the next few weeks.

dannykopping avatar Jan 19 '24 12:01 dannykopping

@dannykopping Any update on the fix..

hiteshagrawal avatar Feb 19 '24 10:02 hiteshagrawal

Hi! Any update on this?

Paulius0112 avatar Feb 19 '24 10:02 Paulius0112