go-carbon
go-carbon copied to clipboard
[BUG] newly created metrics not fetchable even with cache-scan turned on
Describe the bug We have migrated our test environments from [grafana -> graphite-web] -> [ graphite-web -> go-carbon] to [grafana-> carbonapi] -> [ carbonserver -> go-carbon] . Existing and updated metrics are fetchable in a timely manner. However, newly created metrics are not visible until persisted to disk.
We have an internal tool that creates a new random metric once a minute and time show long it takes for the metric to become fetchable. The delay in our low-usage test environment is a few seconds, while the delay on our high-usage test environment is as long as 12 minutes.
Go-carbon Configuration:
# go-carbon's /etc/carbon/carbon.conf
[common]
user = "_graphite"
metric-endpoint="local"
metric-interval = "1m0s"
max-cpu = 6
[whisper]
data-dir = "/graphite/whisper"
schemas-file = "/etc/carbon/storage-schemas.conf"
aggregation-file= "/etc/carbon/storage-aggregation.conf"
workers = 4
max-updates-per-second = 1000
max-creates-per-second = 1000
sparse-create = true
flock = true
enabled = true
hash-filenames = false
[cache]
max-size = 5000000
write-strategy = "max"
[tcp]
listen = ":2003"
enabled = true
buffer-size = 5000000
[logging]
logger = ""
file = "/var/log/carbon/carbon.log"
level = "info"
encoding = "mixed"
encoding-time = "iso8601"
encoding-duration = "seconds"
[carbonlink]
listen = "127.0.0.1:7002"
enabled = true
read-timeout = "30s"
[receiver.http]
protocol = "http"
listen = ":2006"
[carbonserver]
# Please NOTE: carbonserver is not intended to fully replace graphite-web
# It acts as a "REMOTE_STORAGE" for graphite-web or carbonzipper/carbonapi
listen = ":8080"
enabled = true
buckets = 10
metrics-as-counters = false
read-timeout = "90s"
write-timeout = "90s"
query-cache-enabled = true
query-cache-size-mb = 1024
find-cache-enabled = true
trigram-index = false
scan-frequency = "5m0s"
cache-scan = true
max-globs = 100
fail-on-max-globs = false
graphite-web-10-strict-mode = false
internal-stats-dir = ""
# Calculate /render request time percentiles for the bucket, '95' means calculate 95th Percentile. To disable this feature, leave the list blank
stats-percentiles = [99, 98, 95, 75, 50]
#carbonapi configuration
listen: "0.0.0.0:8080"
cache:
type: "mem"
size_mb: 2000
defaultTimeoutSec: 60
upstreams:
graphite09compat: false
buckets: 10
keepAliveInterval: "30s"
maxIdleConnsPerHost: 100
timeouts:
find: "10s"
render: "30s"
connect: "30s"
backendsv2:
backends:
-
groupName: "go_carbon"
protocol: "carbonapi_v3_pb"
lbMethod: "broadcast"
maxTries: 3
maxBatchSize: 0
keepAliveInterval: "10s"
maxIdleConnsPerHost: 1000
timeouts:
find: "10s"
render: "30s"
connect: "30s"
servers:
- "http://10.10.6.109:8080"
- "http://10.10.6.197:8080"
- "http://10.10.2.77:8080"
- "http://10.10.2.24:8080"
Hi @jdblack
As far as I understand, cache-scan = true
will work only with indices, so you need or trigram-index = true
and/or trie-index = true
.
Thank you for the very quick response, @deniszh!
I have added trie-index = true and restarted go-carbon. Unfortunately I'm still seeing a substantial delay in the visibility of created metrics. My [carbonserver] section now reads thusly:
[carbonserver]
# Please NOTE: carbonserver is not intended to fully replace graphite-web
# It acts as a "REMOTE_STORAGE" for graphite-web or carbonzipper/carbonapi
listen = ":8080"
enabled = true
buckets = 10
metrics-as-counters = false
read-timeout = "90s"
write-timeout = "90s"
query-cache-enabled = true
query-cache-size-mb = 1024
find-cache-enabled = true
trie-index = true
scan-frequency = "5m0s"
cache-scan = true
max-globs = 100
fail-on-max-globs = false
graphite-web-10-strict-mode = false
internal-stats-dir = ""
# Calculate /render request time percentiles for the bucket, '95' means calculate 95th Percentile. To disable this feature, leave the list blank
stats-percentiles = [99, 98, 95, 75, 50]
carbon.create_delay, as mentioned in a previous comment, is a custom metric written by a inhouse tool that creates a random metric and times how long it takes for the metric to become visible via a graphite query.
Hi, I think the the cache-scan
is still influenced by scan-frequency
. You might have to lower down cache-scan
if you want new metrics to be available faster. Also scanning the metrics file tree and building the index would take some time. There are some logs/metrics from go-carbon telling how long they takes.
It's just my understanding from reading the commit, I think @saipraneethyss should know more.
another note, you can also curl go-carbon with a force scan request to manually trigger scanning, also added by @saipraneethyss 's MR.
curl 127.0.0.1:8080/forcescan
off topic: I think I can add a new feature to my wip concurrent trie index to make new metrics immediately available, without waiting for scanning being triggered. Should be doable.
@bom-d-van :
I think the the cache-scan is still influenced by scan-frequency.
Looks like it, but I think @saipraneethyss should know, indeed. Also in case above scan-frequency is 5 min, but metrics delayed on 10 mins....
off topic: I think I can add a new feature to my wip concurrent trie index to make new metrics immediately available, without waiting for scanning being triggered. Should be doable.
I would say that's not an offtopic at all, that would be great!
@jdblack : just for clarity - did you try trigram-index=true
also?
@deniszh if go-carbon would have significantly more than max-creates-per-second new metrics, they will be delayed. Also by the value (1000) it looks like underlying storage is rather slow. And carbon in general doesn't like slow storages.
And as far as I remember, with trigram index cabronserver doesn't query cache for metrics that do not have data on disk.
If that's the case, either carbonserver should query cached metrics for those that doesn't exist or it might be a better idea to switch to clickhouse for example
@Civil : you're describing usual go-carbon behavior, I think @jdblack is aware of it. But we're trying to understand why possible fix, which was implemented in #338 is not working. I suspecting that it works only with trigram index enabled.
@deniszh oh, I missed that PR. Thanks.
I suspecting that it works only with trigram index enabled.
@deniszh it's a good point, by reading the code, it should works for both index types: https://github.com/go-graphite/go-carbon/blob/master/carbonserver/carbonserver.go#L624-L638 However, it's better to verify it by trying it out.
Just in case, @jdblack , are you using the latest release of go-carbon published by @deniszh sometime ago? Do you have the version number or commit checksum?
Wow guys! I really appreciate all of you taking such a close look at this! I'd also be very happy to share a zoom session if any of you would like to get a closer view of how this cluster looks and works.
I switched over to trigram-index and I'm still seeing the same problem. I'm using the debian version of go-carbon 0.15.0, md5sum 48406db2a1fbb7df73fbd8e5f44b437b. I have embedded my most recent carbon.conf at the bottom of this comment
A delay of 5 minutes (as per scan frequency config) is only 5 minutes. In this environment, the fetchable-after-creates delay is only 7 minutes, but in some of our production environments, the delay is as long as 40 minutes! As mentioned previously, we detect this delay with an inhouse tool that creates a completely new metric and times how long it takes for the metric to become renderable.
Our filesystems are indeed struggling (we're using compression on zfs on raidz with criminally oversized EBS volumes to buy iops). I agree that there would be no observable problem if the filesystem had plenty of breathing room. I'd like to suggest that it's a fair expectation that go-carbon+carbonserver should be able to render as yet uncreated metrics that it has in it's caches, if even if they haven't been persisted yet.
In fact, existing metrics can be pulled from the cache in the expected way with no problem. It's just metrics that have never existed before, such as when an ASG scales, causing new instances.
Thanks again for all the help!
[common]
user = "_graphite"
metric-endpoint="local"
metric-interval = "1m0s"
max-cpu = 2
[whisper]
data-dir = "/graphite/whisper"
schemas-file = "/etc/carbon/storage-schemas.conf"
aggregation-file= "/etc/carbon/storage-aggregation.conf"
workers = 2
max-updates-per-second = 500
max-creates-per-second = 500
sparse-create = true
flock = true
enabled = true
hash-filenames = false
[cache]
max-size = 5000000
write-strategy = "max"
[tcp]
listen = ":2003"
enabled = true
buffer-size = 5000000
[logging]
logger = ""
file = "/var/log/carbon/carbon.log"
level = "info"
encoding = "mixed"
encoding-time = "iso8601"
encoding-duration = "seconds"
[carbonlink]
listen = "127.0.0.1:7002"
enabled = true
read-timeout = "30s"
[receiver.http]
protocol = "http"
listen = ":2006"
[carbonserver]
# Please NOTE: carbonserver is not intended to fully replace graphite-web
# It acts as a "REMOTE_STORAGE" for graphite-web or carbonzipper/carbonapi
listen = ":8080"
enabled = true
buckets = 10
metrics-as-counters = false
read-timeout = "90s"
write-timeout = "90s"
query-cache-enabled = true
query-cache-size-mb = 1024
find-cache-enabled = true
trigram-index = true
scan-frequency = "5m0s"
cache-scan = true
max-globs = 100
fail-on-max-globs = false
graphite-web-10-strict-mode = false
internal-stats-dir = ""
# Calculate /render request time percentiles for the bucket, '95' means calculate 95th Percentile. To disable this feature, leave the list blank
stats-percentiles = [99, 98, 95, 75, 50]
I also seem to be having a problem, since switching to go-carbon 0.15 and enabling carbonserver and the trie options, where it seems load can temporarily shoot up from around 5 to over 50 for no explicable reason. Could that be the cache-scan option cruising the entire filesystem?
I'm starting to wonder if a lot of our pain and misery is coming from using ZFS for our backing store.
A delay of 5 minutes (as per scan frequency config) is only 5 minutes. In this environment, the fetchable-after-creates delay is only 7 minutes, but in some of our production environments, the delay is as long as 40 minutes!
@jdblack By your description, it seems to be an issue of those servers? maybe you can check if there is anything wrong with those servers.
It isn't that bad to have 7 mins of delay when using a scan frequency of 5 mins. How many metrics do you have?
You can find some metrics exported by go-carbon on indexing and identify the issues on those servers:
carbon.agents.{host}.metrics_known
carbon.agents.{host}.index_build_time_ns
carbon.agents.{host}.file_scan_time_ns
carbon.agents.{host}
is the default prefix. These metrics are also logged in /var/log/carbon/carbon.log
, you should be able grep them.
I also seem to be having a problem, since switching to go-carbon 0.15 and enabling carbonserver and the trie options, where it seems load can temporarily shoot up from around 5 to over 50 for no explicable reason. Could that be the cache-scan option cruising the entire filesystem?
That's new to me, in our case, using trie have less cpu usage, especially when dealing high number of metrics. Both trigram and trie index scans the file systems.
I'm starting to wonder if a lot of our pain and misery is coming from using ZFS for our backing store.
Our graphite team is using xfs.
Yes, a delay of 5-7 minutes would be fine. The environment that we're testing in has a delay of ~ 7-11 minutes. The problem that we're really trying to address is a delay of 40 minutes in our big production environments!
Have you try running a find command on your production environment? How many metrics do you have there? I think you might an issue on file system or hardware level.
time find /graphite/whisper -type f > /dev/null
Yeah, I suspect those lag spikes are coming from ZFS causing load spikes whenever the filesystem is fully scanned.
One thing I'm unclear about; why are the tries required for scanning the filesystem when it's already known that the wsp files dont yet exist on disk?
One thing I'm unclear about; why are the tries required for scanning the filesystem when it's already known that the wsp files dont yet exist on disk?
That's the way indexing works currently in go-carbon, both trigram and trie. The program scans the whole data-dir
every scan-frequency
to figure out what are the new metrics and cache them in memory.
I would also suggest running sudo perf top
to see what the cpu is doing.
I'm very sorry, I must be misunderstanding. Our problem is that we're not able to find/render metric series that do not yet exist on disk at all. In other words, there is no whisper file yet at all. How does scanning the filesystem for a file that we know doesn't exist help make them renderable?
To summary:
For go-carbon, usually you can only query metrics that are already created on disk.
But with cache-scan
, you can query metrics that haven't been created on disk, but it only works when trigram or trie index are enabled.
So to solve your problems of having newly created metrics available faster, you have to either make sure the your go-carbon program flush metrics cache to disk faster or use cache-scan
with trie or trigram index, which requires scanning your file system.
How does scanning the filesystem for a file that we know doesn't exist help make them renderable?
cache-scan
have a feature of returning metrics are still in cache, even though they aren't still not created on disk.
Ahh, but I have found that metrics series that do not exist yet (as in no .wsp file yet) are -not- renderable even if cache-scan and trie-index / trigram-index are turned on.
@jdblack : v0.15.3 was released, including concurrent trie index. You can try did it work for new metrics with
trie-index = true
concurrent-index = true
realtime-index = 100
cache-scan = false
If it works then we probably need to remove cache-scan
code, because it's not really working.
@bom-d-van : what's reasonable value for realtime-index
? 10, 100, 1000, more?
what's reasonable value for realtime-index? 10, 100, 1000, more?
Hi, I think a few thousands should be good enough. Unless the go-carbon instance receives new metrics in a super high frequency. The reason that it's a buffer is to be able to deal with burst of new metrics and have them availble.
remove cache-scan code, because it's not really working.
Oh, for this, realtime-index reuses some part of cache-scan logics, and realtime-index
only supports trie-index at the moment. So we might still have to keep it.
Hi all! sorry for missing out on the earlier discussion. @bom-d-van thanks for pitching in with your valuable responses.
remove cache-scan code, because it's not really working.
We have been running the go-carbon instance with the cache-scan code on our production machine close to 6 months now where we handle more than a 1 Million metrics. With this change we were able to bring down our query-response time for new metrics from more than 30 minutes to less than 5 minutes ( 5m0s
is our scan-frequency
) And so far, this has been working for us fairly well.
And so far, this has been working for us fairly well.
@saipraneethyss : I can only 🤷 here, because according to @jdblack it is not working. But it's quite hard to reproduce and test, indeed.
I'll give 0.15.3 a try!
Some quick updates:
- Switching from ZFS to XFS had -drastic- improvements on timings
- We see a sumSeries(perSecond(carbon.relays.*.metricsReceived)) of about 2.3 million in our largest environment. Does this mean our cluster sees 2.3 million data points per sevond?
Our largest environment has two graphite-cache clusters, a shiny new cluster of 36 instances with XFS, and an old cluster of 36 instances with XFS that is now read only while the xfs cluster fills. The new cluster is about 25 days old, while the old cluster goes back about a year.
- Our new cluster based on xfs has 1.19 million WSPs on each cache and it takes 1m54s on average to perform a find /graphite/whisper -type f.
- The old (now read only) zfs cluster has 3.99M WSP files in each cache. A find /graphite/whisper -type f on these caches usually takes 7M2S to perform.
We ended up having to disable cache_scan as walking the entire filesystem every few minutes was causing lookups against our cluster to collapse as the recurring find operation just sucked away a big chunk of our IOPs.
I think our problem is that the metrics are sitting in go-carbon's input cache, totally hidden until they are processed by go-carbon
When our developers scale up new instances, those new instances push out new metrics under a new namespace of servicename.${IP}.rest.of.the.namespace). These brand new metrics are not renderable until after they have been persisted to disk. This behavior differs for new data points for existing metrics, which metrics in the memory cache -immediately- became renderable, even though the metrics are still sitting in memory.
To say this again; we have no problems with being able to look up fresh data for wsp files that have been around a long time, but we can't seem to look up metrics in carbonserver for metrics for which the wsp does not yet exist. Its as if go-carbon requires that the wsp be at least created before metrics are renderable from cache.
I suspect that our pain comes from the metrics sitting unprocessed in go-carbon's memory cache. Is it possible that there is a bug in which go-carbon is able to render new data points for long lived metrics from cache, but data points for entirely new metrics (as in no wsp previously existed) are not available until after they have been persisted at least once?
Switching from ZFS to XFS gave us a -huge- boost metric write speeds, allowing us to boost max-updates from 500 to 1500 (with IOPS left over for lookups!). This cut our delay from 30-40 minutes to find new metrics down to 9 minutes. The lower we get carbon.agents.*.cache.size, the lower the delay it takes for new metrics to be renderable, which totally make sense.
I upgraded our dev clusters to v0.15.5 and I'm still able to see the non-visualizable effects. This problem is easy to reproduce. Just set max-updates-per-second to value that is low enough to push up the points per update. I performed some tests with one of our smaller test clusters, that only gets ~ 25K/sec metrics.
Adjusting max-updates-per-second had a direct result in the amount of time for newly created metrics to become renderable. New data points for existing metrics were always immediately available. In other words, e never saw lag on our tests for total relay metrics sent and received or for metrics received per cache. We only saw lag on "Graphite Create Delays", which is an internal tool that creates a metric named carbon.create_delay.$RANDOM =timestamp every few seconds and measure how long it takes to become renderable.
Timings from our test:
max-updates-per-second: 450 points per update: 2 delay for novel metrics: 4.5 min
max-updates-per-second: 50 points per update: 17 delay for novel metrics: 30 - 50 min
max-updates-per-second: 1000 points per update: 1 delay for novel metrics : 1.25 min
Your scan-frequency
is "0", try going with "5m".
Are you asking me to set a scan-frequency of "5m" with cache-scan set to disabled?
We had a scan-frequency of 5m0s when cache-scan was turned on (you can see that in my September 28 comment), but it caused our production clusters to fall over, as it took 7 minutes to perform a filesystem scan, starving the filesystem of enough IOPs to perform reads and writes. There's more details in a previous comment (starting with "I'll give 0.15.3 a try!")