thanos store: i/o timeouts with new memcached config

store: i/o timeouts with new memcached config

Open kamilhristov opened this issue 4 years ago • 46 comments

👋 Hello again and thanks for the quick response in https://github.com/thanos-io/thanos/issues/1974

Thanos version used: latest master with https://github.com/thanos-io/thanos/pull/1975

Object Storage Provider: AWS S3

What happened: I am testing the new memcached config for Thanos Store and I am seeing a lot of timeouts. The timeouts are gone when setting the limit to 10s 🤕

Memcached servers look OK and number of connections doesn't exceed max_idle_connections

Thanos Store process and Memcached servers are in the same AZ.

I tried the default config and also tweaked most of the parameters but to no avail.

Any recommendations ?

Arguments:

thanos store --index-cache.config-file=/app/cache-config.yaml --chunk-pool-size=6GB --objstore.config-file=/app/objstore-config.yaml

Config:

type: MEMCACHED
config:
  addresses:
    - "****.0001.euw1.cache.amazonaws.com:11211"
    - "****.0002.euw1.cache.amazonaws.com:11211"
  timeout: 1s
  max_idle_connections: 500
  max_async_concurrency: 200
  max_async_buffer_size: 10000
  max_get_multi_concurrency: 200
  max_get_multi_batch_size: 0
  dns_provider_update_interval: 10s

Logs:

level=warn ts=2020-01-10T11:57:31.215656814Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DY46YH1MSD2N9PZ7M839E0SS:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:35492->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:31.215238251Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DY46YH1MSD2N9PZ7M839E0SS:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:35490->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:31.076957889Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DY46FXKKFSNF3D9TVXB13G1H:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:35496->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:30.831121025Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DXJS43Y6CEXAMVT0HZVC480E:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:34270->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:30.830995908Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DXJS43Y6CEXAMVT0HZVC480E:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:50162->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:30.754847072Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DXJSHHT4N25BRSMM27M3QH8K:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:50160->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:30.754715844Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DXJSHHT4N25BRSMM27M3QH8K:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:46124->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:29.148112488Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.3:34404->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:29.148818833Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.3:47196->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:29.108056739Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.3:48012->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:29.105953129Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.3:34276->10.42.8.43:11211: i/o timeout"

Jan 10 '20 12:01 kamilhristov

cc @pracucci @GiedriusS (:

Jan 10 '20 17:01 bwplotka

Thanks @kamilhristov for the report. I'm seeing in the config you're using AWS ElastiCache: which instance type are you using for ElastiCache?

Jan 10 '20 17:01 pracucci

Could you also show the following metrics graphed?

rate(thanos_memcached_operations_total[1m])
histogram_quantile(1, rate(thanos_memcached_operation_duration_seconds_bucket[1m]))
histogram_quantile(0.9, rate(thanos_memcached_operation_duration_seconds_bucket[1m]))
sum by (item_type) (rate(thanos_store_index_cache_hits_total[1m]) / rate(thanos_store_index_cache_requests_total[1m]))

Random thoughts:

We run memcached inside the same Kubernetes cluster where Thanos is running (on GKE) and we see a 100th percentile latency <= 500ms for memcached operations
A timeout when fetching items may be caused by a very large number of keys fetched within a single getmulti request, which you could limit setting max_get_multi_batch_size: 1000 (1k items per single request looks a reasonable limit to me), but this wouldn't explain the timeout when storing items (items are stored individually)

Jan 10 '20 19:01 pracucci

I am using t3.medium instance type. It is small one but I thought that it should be sufficient since the in memory index cache size was 1GB.

Here's a screenshot of the graphs: Screen Shot 2020-01-10 at 21 40 55

I can also test with m5.large or similar. Also happy to provide more info if needed.

Jan 10 '20 19:01 kamilhristov

So, from your graphs I can see the 90th percentile latency is good, while the 100th is not. In my experience with AWS, performances on t2/t3 are unpredictable, so as next step I would suggest to try with the m5 before reiterating on the root cause analysis. May you give it a try, please?

Jan 10 '20 19:01 pracucci

Agree that t2/t3 could cause strange issues :)

Just tried with m5.large and still there are timeouts for both get and set commands.

Here is a screenshot with the new instance type.

Jan 10 '20 20:01 kamilhristov

Thanks for trying the m5
May you show the same charts but with irate() instead of rate()?
May you try to increase the max idle connections setting and set max_get_multi_batch_size: 1000?

Jan 10 '20 20:01 pracucci

I've increased max idle connections to 1k and max_get_multi_batch_size to 1k.

Now I am getting:

level=error ts=2020-01-10T20:46:07.001594337Z caller=memcached.go:126 msg="failed to cache series in memcached" err="the async buffer is full"

Along with:

level=warn ts=2020-01-10T20:46:09.910966063Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.4:52078->10.42.8.159:11211: i/o timeout"

Current connections graph. The drop is after restarting Thanos Store with new parameters.

And new screenshot with irates 📈

Jan 10 '20 20:01 kamilhristov

level=error ts=2020-01-10T20:46:07.001594337Z caller=memcached.go:126 msg="failed to cache series in memcached" err="the async buffer is full"

This can be easily fixed increasing max_async_buffer_size

Jan 10 '20 21:01 pracucci

level=warn ts=2020-01-10T20:46:09.910966063Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.4:52078->10.42.8.159:11211: i/o timeout"

I still have no clue about this. Do memcached metrics show anything catching your attention? Among the metrics directly exported by memcached you should see the items fetched (we batch reads on the thanos side): what's the rate there?

Jan 10 '20 21:01 pracucci

Another hypothesis is that the set operations are saturating the fetches. Just for testing, may you try with the following settings?

max_async_concurrency: 10
max_async_buffer_size: 1000000

Jan 11 '20 05:01 pracucci

I've also asked about the memcached instance type. May you also confirm Thanos store is not running on a t2/t3 instance, please? I don't think so, but just a confirmation.

Jan 11 '20 06:01 pracucci

I've run some benchmarks in our cluster, using two tools:

memtier_benchmark by RedisLabs
bench: a quick and dirty benchmark tool based on the Thanos MemcachedClient (code here)

Both benchmarks run against the same memcached server (1 instance), running in the same Kubernetes cluster where the benchmarking tool is running.

The results

$ memtier_benchmark --server=10.32.21.9 --port=11211 --protocol=memcache_text --requests=10000 --clients=200 --ratio=1:1 --data-size-range=100-200 --key-pattern=S:S --multi-key-get=100

4         Threads
200       Connections per thread
10000     Requests per client


ALL STATS
=========================================================================
Type         Ops/sec     Hits/sec   Misses/sec      Latency       KB/sec
-------------------------------------------------------------------------
Sets        77044.82          ---          ---      5.23000     13995.64
Gets        77044.82     77044.82         0.00      5.22700     15107.53
Waits           0.00          ---          ---      0.00000          ---
Totals     154089.65     77044.82         0.00      5.22900     29103.17

Request Latency Distribution
Type     <= msec         Percent
------------------------------------------------------------------------
[REDACTED]
SET      56.000       100.00
---
[REDACTED]
GET      56.000       100.00
---

$ ./bench
Clients: 200
Reqs / client: 10000
Keys / req: 100
Data size range: 100 - 200
Config MaxIdleConnections: 400

All clients started. Now waiting to complete.
SET ops: 2000000
GETMULTI avg: 5 ms max: 76 ms
GETMULTI ops: 2000000 ops/s: 36058.11037331169
Total time: 55.466023574s

Comparison on `GetMulti()`

Raw performances: 77044 ops/s, 5.2ms avg latency, 56ms max latency
Benchmark based on Thanos client: 36058 ops/s, 5ms avg latency, 76ms max latency

The performances of the Thanos client don't look bad to me. It's about 50% compared to the raw performances, which looks "reasonable" to me (it shows there's room for improvement, but not bad compared to a c++ tool designed to squeeze out the performances with no business logic around it).

What's next

Other than the latest comments above:

May you run the same benchmark?
May you make sure that your Thanos store doesn't hit the CPU limit? Another reason why you may experience the i/o timeout may be if the Thanos store doesn't have enough CPU resources

Jan 12 '20 08:01 pracucci

Sorry for the late response and thanks for spending time on this.

After setting max_async_concurrency max_async_buffer_size and all other recommended settings, I noticed that there are no failing write commands - only reads are failing.
Thanos Store is running on r5.large - not a huge one but it is perfectly utilized. Just in case I bumped the instance to xlarge - now the CPU usage is around 50% but there are still timeouts.
Benchmark results are not as good yours but still seem OK for our scale.

r5.large

4         Threads
200       Connections per thread
10000     Requests per client


ALL STATS
=========================================================================
Type         Ops/sec     Hits/sec   Misses/sec      Latency       KB/sec
-------------------------------------------------------------------------
Sets        28019.71          ---          ---     14.19600      5089.95
Gets        28019.71     28019.71         0.00     14.19200      5494.32
Waits           0.00          ---          ---      0.00000          ---
Totals      56039.43     28019.71         0.00     14.19400     10584.26


Request Latency Distribution
Type     <= msec         Percent
------------------------------------------------------------------------
SET     130.000       100.00
---
GET     140.000       100.00


Clients: 200
Reqs / client: 10000
Keys / req: 100
Data size range: 100 - 200
Config MaxIdleConnections: 400

All clients started. Now waiting to complete.
SET ops: 2000000
GETMULTI avg: 18 ms max: 190 ms
GETMULTI ops: 2000000 ops/s: 10675.525173500942
Total time: 3m7.344413272s

Then with r5.xlarge max latency is similar but we get more throughput.

Clients: 200
Reqs / client: 10000
Keys / req: 100
Data size range: 100 - 200
Config MaxIdleConnections: 400

All clients started. Now waiting to complete.
SET ops: 2000000
GETMULTI avg: 9 ms max: 219 ms
GETMULTI ops: 2000000 ops/s: 20164.66655510673
Total time: 1m39.183390637s

CloudWatch is showing ~100,000,000 GET commands per minute when executing the benchmark. While Thanos Store generates ~25,000/min when hitting a dashboard.

Might be something with our infrastructure that is causing this. I will investigate more on Monday but will most likely stick with the in-memory cache for now. It seems like better option for our scale. Perhaps when our Thanos data becomes bigger, it will make more sense to offload the index cache to external service.

No need to invest more time on this as it seems that it is not caused by Thanos Store itself and thanks again for helping 🙇

Jan 12 '20 10:01 kamilhristov

Thanks for all your help. I will continue spend a bit more of time on this by myself, and I will get back to you in case I find out anything.

but will most likely stick with the in-memory cache for now

Agree. The in-memory cache is the best option for the majority of the use cases.

Jan 12 '20 14:01 pracucci

I have played around with the new memcached support and my anecdotal evidence says that this happens when memcached cannot keep up because not enough memory has been allocated to it and/or the timeouts are too low. Increasing the memory size with -m helps or enabling large pages support with -L. I will go ahead and release 0.10.0 but we could definitely improve the documentation here with recommendations on how much resources are needed and suggest what values ought to be in the index cache configuration. Please let us know if you find out a stable way to reproduce this.

Jan 12 '20 18:01 GiedriusS

Could this be caused by the keys size ? When max_item_size is 10MB, I am seeing object too large errors in Thanos Store:

level=warn ts=2020-01-13T08:23:13.964559815Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DY46FXKKFSNF3D9TVXB13G1H:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="memcache: unexpected response line from \"set\": \"SERVER_ERROR object too large for cache\\r\\n\""

I was able to reproduce the timeouts when executing the benchmark script with larger data size range. Have in mind that this doesn't even hit the limit so the actual Thanos Store keys are even bigger.

Clients: 200
Reqs / client: 1000
Keys / req: 100
Data size range: 150000 - 300000
Config MaxIdleConnections: 400

All clients started. Now waiting to complete.
level=warn msg="failed to fetch items from memcached" err="write tcp 10.42.8.248:61243->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61161->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61409->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:60915->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61263->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61367->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61253->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61347->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61185->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:60967->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61551->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61449->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61099->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61419->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="write tcp 10.42.8.248:61631->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61239->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61079->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61501->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61531->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61007->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61293->10.42.8.159:11211: i/o timeout"

Jan 13 '20 08:01 kamilhristov

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

Feb 12 '20 09:02 stale[bot]

Facing the same issue, any documentation or work done on this ? : @pracucci Partitioned my bucket by time into 3 stores , all using a r5.large memcache each (3 nodes) [I have put the memcache endpoint and all node endpoints in the store config], facing a lot of timeout and async issues

Apr 05 '20 19:04 kuberkaul

@kuberkaul Could you check following tips in the following conversation https://cloud-native.slack.com/archives/CL25937SP/p1585919142164500? You may find those tips useful.

Although I haven't tested it, I'm working on related configuration on https://github.com/thanos-io/kube-thanos/pull/106.

If you still think this a valid bug, we can open the issue again.

Apr 08 '20 11:04 kakkoyun

I talked offline to @kuberkaul and shared some tips. Parallelly, I'm investigating a potential issue causing unfair prioritisation within the memcached client due to the usage of the gate. I will open a PR to fix if confirmed (will work on it shortly).

Apr 08 '20 12:04 pracucci

Hi, also facing the same issue 😞 Any further update on this? Been looking through some of the suggestions / tips from here and from the slack channel, and have been tweaking memcache configurations, yet I'm still running into timeout or "failed to wait for turn: context canceled" issues while trying to fetch items from memcache. Any advice / update would be much appreciated, thanks!

May 06 '20 07:05 joshulyne

I'm still running into timeout or "failed to wait for turn: context canceled" issues

This log means that a query request has been cancelled. For example, if you open a Grafana dashboard and, while panels are loading, you do refresh it the previous queries are cancelled and if they're cancelled while waiting for a memcached connection slot then you get that error logged.

May 06 '20 08:05 pracucci

@pracucci Ah that makes sense -- thanks for the explanation. Im still running into the timeout errors though :/ any further development or thoughts on that?

May 06 '20 19:05 joshulyne

I'm also currently experiencing this issue while tuning Thanos Store components. Perhaps this issue can be re-opened?

Dec 22 '20 14:12 wiardvanrij

Definitely, let's investigate. Help wanted.

Dec 22 '20 14:12 bwplotka

Base-line that I started with.

  max_idle_connections: 750
  max_async_concurrency: 100
  max_async_buffer_size: 10000000
  max_item_size: 16MiB
  max_get_multi_concurrency: 900
  max_get_multi_batch_size: 100

Then I increased max_get_multi_batch_size to 1000 and even to 10000.

What I learned was that, so long you don't have a massive dashboard that has like 12 panels and you go back 12 months in time, it does not really matter regarding the i/o timeouts. I simply do not get them with "sane" queries. Yet increasing the max_get_multi_batch_size to 1000++ improved my query times a lot. Like, really much smoother results.

So the challenge here is: How to "limit" the stores basically ddosing my memcached instances. So I figured to set the multi concurrency really low, and the batch size quite high.

  max_get_multi_concurrency: 10
  max_get_multi_batch_size: 10000

Basically there was no difference in query speed and I still would get i/o errors.

So I figured to go this path:

  timeout: 10s
  max_idle_connections: 1
  max_async_concurrency: 1
  max_async_buffer_size: 10000000
  max_item_size: 16MiB
  max_get_multi_concurrency: 1
  max_get_multi_batch_size: 1000
  dns_provider_update_interval: 10s

I see no difference and I still get i/o errors on requesting a large dashboard with 30++ days history (i.e. non-sane queries basically).

From a user perspective: I get the feeling that the settings by --index-cache.config-file provide no actual input in the behaviour

Dec 22 '20 15:12 wiardvanrij

So, I have fixed this issue for myself. As it was an implementation error xD

After investigation, because the config (per above) does not seem to make any difference, I noticed that it was NOT about the memcached index cache, but the bucket cache.

When I investigated this further, I noticed that I made a mistake on it's config. This was my config:

config:
  addresses: [*redacted*]
  chunk_subrange_size: 16000
  max_chunks_get_range_requests: 3
  chunk_object_attrs_ttl: 24h
  chunk_subrange_ttl: 24h
  blocks_iter_ttl: 5m
  metafile_exists_ttl: 2h
  metafile_doesnt_exist_ttl: 15m
  metafile_content_ttl: 24h
  metafile_max_size: 16MiB

I made the assumption that the bucket cache was not "that special" in comparison to the index cache
I missed the line in the docs that stated: "config field for memcached supports all the same configuration as memcached for index cache"

So after fixing my config file for the bucket cache to this, it was fixed:

type: MEMCACHED
config:
  addresses: [*redacted*]
  max_idle_connections: 750
  max_async_concurrency: 100
  max_async_buffer_size: 10000000
  max_item_size: 16MiB
  max_get_multi_concurrency: 750
  max_get_multi_batch_size: 1000
chunk_subrange_size: 16000
max_chunks_get_range_requests: 3
chunk_object_attrs_ttl: 24h
chunk_subrange_ttl: 24h
blocks_iter_ttl: 5m
metafile_exists_ttl: 2h
metafile_doesnt_exist_ttl: 15m
metafile_content_ttl: 24h
metafile_max_size: 16MiB

To be honest, a few improvements could be made IMO:

Update docs here: https://thanos.io/tip/components/store.md/#caching-bucket with an example of actually implementing the config for memcached
Give better log info regarding which memcached type is hit (is the log for index or bucket??)
When doing batchrequests, only the last error is being output. Add a debug log for each error, rather than saving the last.

I'm a bit tired, but will make issues/PR's tomorrow or something ;)

Dec 22 '20 23:12 wiardvanrij

Created #3666 , #3667 and #3668 :)

Dec 23 '20 07:12 wiardvanrij

Hello 👋 Looks like there was no activity on this issue for the last two months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Feb 21 '21 15:02 stale[bot]

thanos thanos copied to clipboard

store: i/o timeouts with new memcached config

The results

Comparison on GetMulti()

What's next

thanos
thanos copied to clipboard

Comparison on `GetMulti()`