go-carbon
go-carbon copied to clipboard
[Q] What to fine-tune to decrease disk workload?
I'm using latest go-graphite v0.15.5
and recently have both enabled trie-index
and switched CPU to GCP N2 family.
From the first glance, it looks like server workload (CPU, memory) has decreased almost twice, traffic has not changed, but the bottleneck became SSD disk in terms of write IOPS.
What should I fine-tune to decrease disk workload? Could it be related to stats retentions misconfig?
Some details:
Machine: n2-highcpu-16 (16 vCPUs, 16 GB memory), 1TB SSD disk
At peak time go-carbon handles ~16 qps
On average only 0.4 metrics per second are created.
Here is the diff between my config and go-carbon.conf.example:
[common]
max-cpu = 16
[cache]
max-size = 10000000
write-strategy = "noop"
[carbonserver]
enabled = true
query-cache-enabled = false
find-cache-enabled = false
trie-index = true
file-list-cache = ""
concurrent-index = true
realtime-index = 100
[[logging]]
level = "warn"
[[logging]]
level = "warn"
This is full storage-schemas.conf
content:
[carbon]
pattern = ^carbon\.
retentions = 5m:90d
compressed = false
[stats]
pattern = ^stats.*
retentions = 10s:1h,60s:1d,10m:30d,1h:90d,24h:1y
compressed = false
This is full storage-aggregation.conf
content:
[min]
pattern = \.lower$
xFilesFactor = 0.1
aggregationMethod = min
[max]
pattern = \.upper(_\d+)?$
xFilesFactor = 0.1
aggregationMethod = max
[sum]
pattern = \.sum$
xFilesFactor = 0
aggregationMethod = sum
[count]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = sum
[count_legacy]
pattern = ^stats_counts.*
xFilesFactor = 0
aggregationMethod = sum
[default_average]
pattern = .*
xFilesFactor = 0.3
aggregationMethod = average
Some system-wise graphs:
Few graphite-wise graphs:
Hi @ritmas ,
Theoretically, trie-index
should no affect write performance, and change looks quite dramatic. Maybe something else was also change?
Also, please note that read querying also not relevant to amount of writes (vice versa is not true, but that's a different story).
Only amount of new metric and existing metrics is relevant to wrie load.
So, default setup has no limit on disk load:
[whisper]
# Limits the number of whisper update_many() calls per second. 0 - no limit
max-updates-per-second = 0
# Softly limits the number of whisper files that get created each second. 0 - no limit
max-creates-per-second = 0
You can try to limit disk usage and increase cache. E.g.
[cache]
max-size = 100000000
[whisper]
max-updates-per-second = 200000
max-creates-per-second = 100
Check e.g. https://cloud.google.com/compute/docs/disks/performance for iops limit, but also please note that this update is not IOPS but Update operations from graphs above.
Yeah, as @deniszh said, there wasn't intended write path changes for standard whisper file. And it looks like the io usage increased is due to faster updates and creates of metrics.
But it might arguably be a good thing, as your data now hit disk faster. Are you checking the cache.queueWriteoutTime
metrics? Hopefully it should be lower as a result.
(also nice to see memory and cpu usage lowered with trie stuff, that would keep us motivated, thanks)
Hmm, am I misunderstanding something. The disk io requests is lower or largely unchanged in your screenshot?
@deniszh as fas I remember I tried max-updates-per-second
some time ago with any noticeable changes, but I will give it a shot again with your suggested config and will leave it for at least 24 hours.
I'm aware of GCP disk IOPS limit dependent on CPU quantity, that's why I'm interested in trying to decrease disk workload with go-carbon fine-tuning (if any available) in first place. Last resort will be increasing VM resources.
@bom-d-van yes, indeed, cache.queueWriteoutTime
has decreased generally speaking. It has increased/spiked again lately, but it might be related to new metrics (tests, natural traffic), not sure to be honest.
Regarding IO quantity, on average it has decreased a bit. Before enabling trie-index
it was roughly about 17k and now about 15k. But major part of it has shifted towards write operations. As you have pointed it out with decreased cache write timeout
, it does make sense that write operations have increased a bit.
Adding bigger graph:
Just speculation, maybe the reduced memory usage of go-carbon was repurposed by linux page cache, thus it leads to less disk reads and faster/higher disk writes. In your graph, the cache memory does increased.
I get your io graph now. I didn't notice that the write io metric was negative values.
@deniszh, it seems SSD IO/reqs rate went back to its previous state (or even more) - which is good. Now, can we make it to the "awesome"? :)
What I don't still understand fully is how to calculate cache max-size
:
[cache]
# Limit of in-memory stored points (not metrics)
max-size = 100000000
Those config changes you've suggested did not increased memory usage, so it seems it's possible to increase max-size
even more. But where is the limit? What formula to apply?
By the way, I wanted to denote that I'm using v0.15.5 which is specific tag, not the master
branch. I saw few fixes recently, but not sure whether they could help me in this case?
Hi @ritmas,
I do not see many fixes, just one, which I just merged into master. I'm going to release 0.15.6, but I do not think it will affect issue above.
What I don't still understand fully is how to calculate cache max-size:
That's a hard part, indeed. I'm afraid it can be calculated only in empirical way, i.e. with trial-and-error. Math behind it is not complicated - it's number of datapoints in cache. If you know your interval and number of metrics you can calculate how much datapoints it will keep in cache until your disk will flush it to disk. I.e. You have 1M metrics/min coming, resolution is 10 seconds - then 10M cache will hold approximately 10 000 000 / 60 / 10 = 1.5 second of incoming flow, which is not much. (If your resolution is 1minute then 10M cache give you 10 seconds). You can safely increase it 10x - if you have enough RAM. Check how much memory go-carbon process consumes and assume that whole that memory is metric cache (which is not true ofc, only for approximation purposes).
OTOH it's very hard to say how much gain of IOPS you will get from that cache increase - that's why some empyrical test needed.
Those config changes you've suggested did not increased memory usage
You need to look how much memory go-carbon process itself consumes. I think majority of "cached" is hot set of whisper files from linux file cache, as @bom-d-van suggested.
I do not see many fixes, just one, which I just merged into master.
Yeah, I meant changes in general I guess.
You can safely increase it 10x - if you have enough RAM. Check how much memory go-carbon process consumes <...>
Based on RSS part (if this is correct one), go-carbon uses about 4gb of memory, so I'll go with x5 firstly:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 23789 102 25.1 6962628 4087748 ? Dsl Jan21 4098:35 /data/go-carbon/go-carbon -config /etc/go-carbon/go-carbon.conf
In addition to this, @deniszh could you give more info on max-updates-per-second
also?
# Limits the number of whisper update_many() calls per second. 0 - no limit
max-updates-per-second = 200000
How does the internals work? Are these changes queued in the memory for specific period of time? If this value is decreased down x10, should one expect less write operations to the disk?
Based on RSS part (if this is correct one), go-carbon uses about 4gb of memory, so I'll go with x5 firstly
Yes, 5 times looks better, it's better to increwase it gracefully.
How does the internals work? Are these changes queued in the memory for specific period of time? If this value is decreased down x10, should one expect less write operations to the disk?
Long story short - it's complicated. I just took 200K from your graph above, which said that limit is 500K, it's just a number of calls to store() function which stores data, and it's controlled by throttler. Also, looks like limit is per worker, so, 200K probably have no effect, for 8 workers it should be around 500K / 8 ~ 60000 (?) I mean looks like effect above was only because increasing cache size from 1M to 10M. If your writer will be throttled - then your cache will increase and with such small cache vs incoming flow (10M points vs 60M points/min) can overflow cache very fast. So, that part also need to be carefully tuned. Usually, people just putting IO subsystem limit there and play with cache size.
I made few changes/tests lately, but cannot see any change in terms of disk IO nor CPU/memory. Also double-checked with GCP graphs. @deniszh what am I doing wrong?
2021-01-25 09:00
[whisper]
max-updates-per-second = 60000
[cache]
max-size = 500000000
2021-01-26 10:30
[whisper]
max-updates-per-second = 30000
2021-01-27 9:30
[whisper]
max-updates-per-second = 10000
2021-01-28 12:50
[whisper]
max-updates-per-second = 5000
Was there any changes in your grapph "update operations vs creates & cpu", for go-carbon?
@bom-d-van there is nothing significant which would catch the eye
Is the blue line "Update Operations"? It does lower significantly on 29 (from 600k to 300k) when you lowered it 5000. You can try pushing it further, but do keep an eye on the cache usage, cache.queueWriteoutTime
and avoid unnecessary drops.
Yes @bom-d-van it is and it looks throttled, but not the rate at I was expected. As @deniszh mentioned the limit suppose to be per worker so I was expected ~40k limit: 8 workers * 5000 max-updates-per-second = 40k
So I pushed it a bit further:
2021-01-29 14:10
[whisper]
max-updates-per-second = 1000
2021-01-29 15:25
[whisper]
max-updates-per-second = 500
So from the point of graphs throttling works (formula is inaccurate though), but actual disk utilization is hitting its limits.
# iostat -xd 1 /dev/sdb
Linux 3.10.0-1160.11.1.el7.x86_64 (<redacted>) 02/23/2021 _x86_64_ (16 CPU)
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 60.00 0.00 1386.00 23284.00 40824.00 104164.00 11.75 144.81 5.87 5.67 5.88 0.04 99.90
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 53.00 0.00 1338.00 23342.00 38964.00 106224.00 11.77 146.78 5.95 5.80 5.95 0.04 100.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 50.00 0.00 1321.00 23358.00 39328.00 106396.00 11.81 143.08 5.79 5.59 5.81 0.04 99.50
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 143.00 1187.00 3487.00 18095.00 99444.00 217952.00 29.41 95.46 4.45 1.91 4.94 0.05 99.20
GCP graphs show lower mean/average IOPS quantity, but writing peaks/spikes remain constant.
What else can I try?
I think you can also try lower down max-creates-per-second
, it looks like sometimes it peaks at 500, depends on the retention policy, if it's 1MB per whisper file, it could be 500MB.
Just in case, with lower threshold for flushing cache data to your server, it means the in-memory cache is bigger and you want to pay attention to cache.queueWriteoutTime
, cache.overflow
and cache.size
from go-carbon.
Other alternative could be:
- scaling your go-carbon cluster horizontally, by adding a few more machines to it
- scaling your go-carbon instance vertically, by moving to machine with more powerful disk
- you can also try migrate to compressed whisper
Compressed whisper has much better disk performance due to compression and less io, but you would lose the capability of out-of-order update and history rewrite, if it isn't issue for your cluster, I would recommend you trying it out.
@ritmas
What else can I try?
TBH I still not really getting what you trying to achieve and why high IOPS are bad for you, if everything works fine. But if it's really an issue you can try to migrate to carbon-clickhouse / graphite-clickhouse setup. It will give you much lesser iops, but you'll need to setup and manage Clickhouse. See TLDR repo if interested - https://github.com/lomik/graphite-clickhouse-tldr
@bom-d-van is compressed whisper applicable on existing metrics/data or does it need to be enabled on brand new setup? Also, is this feature reversible?
@deniszh the initial assumption was that my go-carbon setup is under huge workload and not all the data/traffic is being handled properly as the disk seems overutilized. So I reached you out on configuration options. Increasing go-carbon cache max-size
and whisper max-updates-per-second
helped to lower write IOPS, but this throttling thing is still a mystery to me as actual iostat
still shows higher IO rate than expected.
Anyway, if cache.overflow
shows zero value it seems I'm not loosing any metrics (#402) due to high IOPS rate nor cache size limit after all.
PS - clickhouse looks interesting
is compressed whisper applicable on existing metrics/data or does it need to be enabled on brand new setup?
Migration is needed. You have two approaches:
- Use the convert program in go-whisper to migrate existing whisper files to compressed format
- Create a new go-carbon cluster with
whisper.compressed = true
and use buckytools to sync data from the uncompressed cluster
Option 2 is probably better, in cases things go wrong, you won't lose any data.
Option 1 is good if you already two or more clusters, also, initial compression ratio is better using convert program.
You can enable compression for the whole go-carbon instance/cluster by default with:
[whisper]
compressed = true
You can also enable compression with certain types of metrics use pattern matching: https://github.com/go-graphite/go-carbon/blob/master/deploy/storage-schemas.conf#L9
Also, is this feature reversible?
Not really. So it's better to test it out first. Duplicate your data in two clusters, enable one with compression before you decide if it's feature that works better for you.
Quick update on iostat
and especially %util
part if it's relevant for anyone - one cannot be trusted when it comes to SSD/NVMe.
%util Percentage of elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.
Few resources on this topic:
- https://coderwall.com/p/utc42q/understanding-iostat
- https://www.xaprb.com/blog/2010/01/09/how-linux-iostat-computes-its-results/
Actual disk limitations should be calculated in different way. For the time being I personally rely on tps | kB_read/s | kB_wrtn/s
values which are IOPS and throughput accordingly.
Based on GCP disk performance calculations, limitations of zonal SSD of 1000gb in size (N2 CPU) are 25k IOPS
and 1200 MB/s
which is higher than actual usage:
# iostat -d sdb 1 2
Linux 3.10.0-1160.11.1.el7.x86_64 (<redacted>) 08/06/2021 _x86_64_ (16 CPU)
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sdb 5784.11 81305.53 15128.40 1158379925681 215538064508
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sdb 5075.00 73492.00 7712.00 73492 7712