thanos
thanos copied to clipboard
store: i/o timeouts with new memcached config
👋 Hello again and thanks for the quick response in https://github.com/thanos-io/thanos/issues/1974
Thanos version used: latest master with https://github.com/thanos-io/thanos/pull/1975
Object Storage Provider: AWS S3
What happened: I am testing the new memcached config for Thanos Store and I am seeing a lot of timeouts. The timeouts are gone when setting the limit to 10s 🤕
Memcached servers look OK and number of connections doesn't exceed max_idle_connections
Thanos Store process and Memcached servers are in the same AZ.
I tried the default config and also tweaked most of the parameters but to no avail.
Any recommendations ?
Arguments:
thanos store --index-cache.config-file=/app/cache-config.yaml --chunk-pool-size=6GB --objstore.config-file=/app/objstore-config.yaml
Config:
type: MEMCACHED
config:
addresses:
- "****.0001.euw1.cache.amazonaws.com:11211"
- "****.0002.euw1.cache.amazonaws.com:11211"
timeout: 1s
max_idle_connections: 500
max_async_concurrency: 200
max_async_buffer_size: 10000
max_get_multi_concurrency: 200
max_get_multi_batch_size: 0
dns_provider_update_interval: 10s
Logs:
level=warn ts=2020-01-10T11:57:31.215656814Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DY46YH1MSD2N9PZ7M839E0SS:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:35492->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:31.215238251Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DY46YH1MSD2N9PZ7M839E0SS:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:35490->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:31.076957889Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DY46FXKKFSNF3D9TVXB13G1H:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:35496->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:30.831121025Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DXJS43Y6CEXAMVT0HZVC480E:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:34270->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:30.830995908Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DXJS43Y6CEXAMVT0HZVC480E:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:50162->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:30.754847072Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DXJSHHT4N25BRSMM27M3QH8K:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:50160->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:30.754715844Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DXJSHHT4N25BRSMM27M3QH8K:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:46124->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:29.148112488Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.3:34404->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:29.148818833Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.3:47196->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:29.108056739Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.3:48012->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:29.105953129Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.3:34276->10.42.8.43:11211: i/o timeout"
cc @pracucci @GiedriusS (:
Thanks @kamilhristov for the report. I'm seeing in the config you're using AWS ElastiCache: which instance type are you using for ElastiCache?
Could you also show the following metrics graphed?
-
rate(thanos_memcached_operations_total[1m])
-
histogram_quantile(1, rate(thanos_memcached_operation_duration_seconds_bucket[1m]))
-
histogram_quantile(0.9, rate(thanos_memcached_operation_duration_seconds_bucket[1m]))
-
sum by (item_type) (rate(thanos_store_index_cache_hits_total[1m]) / rate(thanos_store_index_cache_requests_total[1m]))
Random thoughts:
- We run memcached inside the same Kubernetes cluster where Thanos is running (on GKE) and we see a 100th percentile latency <= 500ms for memcached operations
- A timeout when fetching items may be caused by a very large number of keys fetched within a single
getmulti
request, which you could limit settingmax_get_multi_batch_size: 1000
(1k items per single request looks a reasonable limit to me), but this wouldn't explain the timeout when storing items (items are stored individually)
I am using t3.medium instance type. It is small one but I thought that it should be sufficient since the in memory index cache size was 1GB.
Here's a screenshot of the graphs:
I can also test with m5.large or similar. Also happy to provide more info if needed.
So, from your graphs I can see the 90th percentile latency is good, while the 100th is not. In my experience with AWS, performances on t2/t3 are unpredictable, so as next step I would suggest to try with the m5 before reiterating on the root cause analysis. May you give it a try, please?
Agree that t2/t3 could cause strange issues :)
Just tried with m5.large and still there are timeouts for both get and set commands.
Here is a screenshot with the new instance type.
![Screen Shot 2020-01-10 at 22 17 58](https://user-images.githubusercontent.com/7661748/72183747-6bfe7b80-33f7-11ea-8ac9-cc4589e7918e.png)
- Thanks for trying the m5
- May you show the same charts but with irate() instead of rate()?
- May you try to increase the max idle connections setting and set
max_get_multi_batch_size: 1000
?
I've increased max idle connections to 1k and max_get_multi_batch_size to 1k.
Now I am getting:
level=error ts=2020-01-10T20:46:07.001594337Z caller=memcached.go:126 msg="failed to cache series in memcached" err="the async buffer is full"
Along with:
level=warn ts=2020-01-10T20:46:09.910966063Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.4:52078->10.42.8.159:11211: i/o timeout"
Current connections graph. The drop is after restarting Thanos Store with new parameters.
![Screen Shot 2020-01-10 at 22 52 48](https://user-images.githubusercontent.com/7661748/72185672-0d87cc00-33fc-11ea-8bda-4570b5275453.png)
And new screenshot with irates 📈
![Screen Shot 2020-01-10 at 22 54 10](https://user-images.githubusercontent.com/7661748/72185748-33ad6c00-33fc-11ea-9da7-98eaa2f1fc7a.png)
level=error ts=2020-01-10T20:46:07.001594337Z caller=memcached.go:126 msg="failed to cache series in memcached" err="the async buffer is full"
This can be easily fixed increasing max_async_buffer_size
level=warn ts=2020-01-10T20:46:09.910966063Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.4:52078->10.42.8.159:11211: i/o timeout"
I still have no clue about this. Do memcached metrics show anything catching your attention? Among the metrics directly exported by memcached you should see the items fetched (we batch reads on the thanos side): what's the rate there?
Another hypothesis is that the set operations are saturating the fetches. Just for testing, may you try with the following settings?
max_async_concurrency: 10
max_async_buffer_size: 1000000
I've also asked about the memcached instance type. May you also confirm Thanos store is not running on a t2/t3 instance, please? I don't think so, but just a confirmation.
I've run some benchmarks in our cluster, using two tools:
-
memtier_benchmark
by RedisLabs -
bench
: a quick and dirty benchmark tool based on the ThanosMemcachedClient
(code here)
Both benchmarks run against the same memcached server (1 instance), running in the same Kubernetes cluster where the benchmarking tool is running.
The results
$ memtier_benchmark --server=10.32.21.9 --port=11211 --protocol=memcache_text --requests=10000 --clients=200 --ratio=1:1 --data-size-range=100-200 --key-pattern=S:S --multi-key-get=100
4 Threads
200 Connections per thread
10000 Requests per client
ALL STATS
=========================================================================
Type Ops/sec Hits/sec Misses/sec Latency KB/sec
-------------------------------------------------------------------------
Sets 77044.82 --- --- 5.23000 13995.64
Gets 77044.82 77044.82 0.00 5.22700 15107.53
Waits 0.00 --- --- 0.00000 ---
Totals 154089.65 77044.82 0.00 5.22900 29103.17
Request Latency Distribution
Type <= msec Percent
------------------------------------------------------------------------
[REDACTED]
SET 56.000 100.00
---
[REDACTED]
GET 56.000 100.00
---
$ ./bench
Clients: 200
Reqs / client: 10000
Keys / req: 100
Data size range: 100 - 200
Config MaxIdleConnections: 400
All clients started. Now waiting to complete.
SET ops: 2000000
GETMULTI avg: 5 ms max: 76 ms
GETMULTI ops: 2000000 ops/s: 36058.11037331169
Total time: 55.466023574s
Comparison on GetMulti()
- Raw performances:
77044
ops/s,5.2ms
avg latency,56ms
max latency - Benchmark based on Thanos client:
36058
ops/s,5ms
avg latency,76ms
max latency
The performances of the Thanos client don't look bad to me. It's about 50% compared to the raw performances, which looks "reasonable" to me (it shows there's room for improvement, but not bad compared to a c++ tool designed to squeeze out the performances with no business logic around it).
What's next
Other than the latest comments above:
- May you run the same benchmark?
- May you make sure that your Thanos store doesn't hit the CPU limit? Another reason why you may experience the i/o timeout may be if the Thanos store doesn't have enough CPU resources
Sorry for the late response and thanks for spending time on this.
-
After setting max_async_concurrency max_async_buffer_size and all other recommended settings, I noticed that there are no failing write commands - only reads are failing.
-
Thanos Store is running on r5.large - not a huge one but it is perfectly utilized. Just in case I bumped the instance to xlarge - now the CPU usage is around 50% but there are still timeouts.
-
Benchmark results are not as good yours but still seem OK for our scale.
r5.large
4 Threads
200 Connections per thread
10000 Requests per client
ALL STATS
=========================================================================
Type Ops/sec Hits/sec Misses/sec Latency KB/sec
-------------------------------------------------------------------------
Sets 28019.71 --- --- 14.19600 5089.95
Gets 28019.71 28019.71 0.00 14.19200 5494.32
Waits 0.00 --- --- 0.00000 ---
Totals 56039.43 28019.71 0.00 14.19400 10584.26
Request Latency Distribution
Type <= msec Percent
------------------------------------------------------------------------
SET 130.000 100.00
---
GET 140.000 100.00
Clients: 200
Reqs / client: 10000
Keys / req: 100
Data size range: 100 - 200
Config MaxIdleConnections: 400
All clients started. Now waiting to complete.
SET ops: 2000000
GETMULTI avg: 18 ms max: 190 ms
GETMULTI ops: 2000000 ops/s: 10675.525173500942
Total time: 3m7.344413272s
Then with r5.xlarge max latency is similar but we get more throughput.
Clients: 200
Reqs / client: 10000
Keys / req: 100
Data size range: 100 - 200
Config MaxIdleConnections: 400
All clients started. Now waiting to complete.
SET ops: 2000000
GETMULTI avg: 9 ms max: 219 ms
GETMULTI ops: 2000000 ops/s: 20164.66655510673
Total time: 1m39.183390637s
CloudWatch is showing ~100,000,000 GET commands per minute when executing the benchmark. While Thanos Store generates ~25,000/min when hitting a dashboard.
Might be something with our infrastructure that is causing this. I will investigate more on Monday but will most likely stick with the in-memory cache for now. It seems like better option for our scale. Perhaps when our Thanos data becomes bigger, it will make more sense to offload the index cache to external service.
No need to invest more time on this as it seems that it is not caused by Thanos Store itself and thanks again for helping 🙇
Thanks for all your help. I will continue spend a bit more of time on this by myself, and I will get back to you in case I find out anything.
but will most likely stick with the in-memory cache for now
Agree. The in-memory cache is the best option for the majority of the use cases.
I have played around with the new memcached support and my anecdotal evidence says that this happens when memcached cannot keep up because not enough memory has been allocated to it and/or the timeouts are too low. Increasing the memory size with -m
helps or enabling large pages support with -L
. I will go ahead and release 0.10.0
but we could definitely improve the documentation here with recommendations on how much resources are needed and suggest what values ought to be in the index cache configuration. Please let us know if you find out a stable way to reproduce this.
Could this be caused by the keys size ? When max_item_size is 10MB, I am seeing object too large errors in Thanos Store:
level=warn ts=2020-01-13T08:23:13.964559815Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DY46FXKKFSNF3D9TVXB13G1H:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="memcache: unexpected response line from \"set\": \"SERVER_ERROR object too large for cache\\r\\n\""
I was able to reproduce the timeouts when executing the benchmark script with larger data size range. Have in mind that this doesn't even hit the limit so the actual Thanos Store keys are even bigger.
Clients: 200
Reqs / client: 1000
Keys / req: 100
Data size range: 150000 - 300000
Config MaxIdleConnections: 400
All clients started. Now waiting to complete.
level=warn msg="failed to fetch items from memcached" err="write tcp 10.42.8.248:61243->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61161->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61409->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:60915->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61263->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61367->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61253->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61347->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61185->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:60967->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61551->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61449->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61099->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61419->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="write tcp 10.42.8.248:61631->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61239->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61079->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61501->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61531->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61007->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61293->10.42.8.159:11211: i/o timeout"
This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.
Facing the same issue, any documentation or work done on this ? : @pracucci Partitioned my bucket by time into 3 stores , all using a r5.large memcache each (3 nodes) [I have put the memcache endpoint and all node endpoints in the store config], facing a lot of timeout and async issues
@kuberkaul Could you check following tips in the following conversation https://cloud-native.slack.com/archives/CL25937SP/p1585919142164500? You may find those tips useful.
Although I haven't tested it, I'm working on related configuration on https://github.com/thanos-io/kube-thanos/pull/106.
If you still think this a valid bug, we can open the issue again.
I talked offline to @kuberkaul and shared some tips. Parallelly, I'm investigating a potential issue causing unfair prioritisation within the memcached client due to the usage of the gate. I will open a PR to fix if confirmed (will work on it shortly).
Hi, also facing the same issue 😞 Any further update on this? Been looking through some of the suggestions / tips from here and from the slack channel, and have been tweaking memcache configurations, yet I'm still running into timeout or "failed to wait for turn: context canceled"
issues while trying to fetch items from memcache. Any advice / update would be much appreciated, thanks!
I'm still running into timeout or "failed to wait for turn: context canceled" issues
This log means that a query request has been cancelled. For example, if you open a Grafana dashboard and, while panels are loading, you do refresh it the previous queries are cancelled and if they're cancelled while waiting for a memcached connection slot then you get that error logged.
@pracucci Ah that makes sense -- thanks for the explanation. Im still running into the timeout errors though :/ any further development or thoughts on that?
I'm also currently experiencing this issue while tuning Thanos Store components. Perhaps this issue can be re-opened?
Definitely, let's investigate. Help wanted.
Base-line that I started with.
max_idle_connections: 750
max_async_concurrency: 100
max_async_buffer_size: 10000000
max_item_size: 16MiB
max_get_multi_concurrency: 900
max_get_multi_batch_size: 100
Then I increased max_get_multi_batch_size
to 1000 and even to 10000.
What I learned was that, so long you don't have a massive dashboard that has like 12 panels and you go back 12 months in time, it does not really matter regarding the i/o timeouts. I simply do not get them with "sane" queries. Yet increasing the max_get_multi_batch_size
to 1000++ improved my query times a lot. Like, really much smoother results.
So the challenge here is: How to "limit" the stores basically ddosing my memcached instances. So I figured to set the multi concurrency really low, and the batch size quite high.
max_get_multi_concurrency: 10
max_get_multi_batch_size: 10000
Basically there was no difference in query speed and I still would get i/o errors.
So I figured to go this path:
timeout: 10s
max_idle_connections: 1
max_async_concurrency: 1
max_async_buffer_size: 10000000
max_item_size: 16MiB
max_get_multi_concurrency: 1
max_get_multi_batch_size: 1000
dns_provider_update_interval: 10s
I see no difference and I still get i/o errors on requesting a large dashboard with 30++ days history (i.e. non-sane queries basically).
From a user perspective: I get the feeling that the settings by --index-cache.config-file
provide no actual input in the behaviour
So, I have fixed this issue for myself. As it was an implementation error xD
After investigation, because the config (per above) does not seem to make any difference, I noticed that it was NOT about the memcached index cache, but the bucket cache.
When I investigated this further, I noticed that I made a mistake on it's config. This was my config:
config:
addresses: [*redacted*]
chunk_subrange_size: 16000
max_chunks_get_range_requests: 3
chunk_object_attrs_ttl: 24h
chunk_subrange_ttl: 24h
blocks_iter_ttl: 5m
metafile_exists_ttl: 2h
metafile_doesnt_exist_ttl: 15m
metafile_content_ttl: 24h
metafile_max_size: 16MiB
- I made the assumption that the bucket cache was not "that special" in comparison to the index cache
- I missed the line in the docs that stated: "config field for memcached supports all the same configuration as memcached for index cache"
So after fixing my config file for the bucket cache to this, it was fixed:
type: MEMCACHED
config:
addresses: [*redacted*]
max_idle_connections: 750
max_async_concurrency: 100
max_async_buffer_size: 10000000
max_item_size: 16MiB
max_get_multi_concurrency: 750
max_get_multi_batch_size: 1000
chunk_subrange_size: 16000
max_chunks_get_range_requests: 3
chunk_object_attrs_ttl: 24h
chunk_subrange_ttl: 24h
blocks_iter_ttl: 5m
metafile_exists_ttl: 2h
metafile_doesnt_exist_ttl: 15m
metafile_content_ttl: 24h
metafile_max_size: 16MiB
To be honest, a few improvements could be made IMO:
- Update docs here:
https://thanos.io/tip/components/store.md/#caching-bucket
with an example of actually implementing the config for memcached - Give better log info regarding which memcached type is hit (is the log for index or bucket??)
- When doing batchrequests, only the last error is being output. Add a debug log for each error, rather than saving the last.
I'm a bit tired, but will make issues/PR's tomorrow or something ;)
Created #3666 , #3667 and #3668 :)
Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind
command if you wish to be reminded at some point in future.