go-carbon [BUG] newly created metrics not fetchable even with cache-scan turned on

trafficstars

Describe the bug We have migrated our test environments from [grafana -> graphite-web] -> [ graphite-web -> go-carbon] to [grafana-> carbonapi] -> [ carbonserver -> go-carbon] . Existing and updated metrics are fetchable in a timely manner. However, newly created metrics are not visible until persisted to disk.

We have an internal tool that creates a new random metric once a minute and time show long it takes for the metric to become fetchable. The delay in our low-usage test environment is a few seconds, while the delay on our high-usage test environment is as long as 12 minutes.

Go-carbon Configuration:

# go-carbon's   /etc/carbon/carbon.conf
[common]
user = "_graphite"
metric-endpoint="local"
metric-interval = "1m0s"
max-cpu = 6

[whisper]
data-dir = "/graphite/whisper"
schemas-file = "/etc/carbon/storage-schemas.conf"
aggregation-file= "/etc/carbon/storage-aggregation.conf"
workers = 4
max-updates-per-second = 1000
max-creates-per-second = 1000
sparse-create = true
flock = true
enabled = true
hash-filenames = false

[cache]
max-size = 5000000
write-strategy = "max"

[tcp]
listen = ":2003"
enabled = true
buffer-size = 5000000

[logging]
logger = ""
file = "/var/log/carbon/carbon.log"
level = "info"
encoding = "mixed"
encoding-time = "iso8601"
encoding-duration = "seconds"

[carbonlink]
listen  = "127.0.0.1:7002"
enabled = true
read-timeout = "30s"

[receiver.http]
protocol = "http"
listen = ":2006"

[carbonserver]
# Please NOTE: carbonserver is not intended to fully replace graphite-web
# It acts as a "REMOTE_STORAGE" for graphite-web or carbonzipper/carbonapi
listen = ":8080"
enabled = true
buckets = 10
metrics-as-counters = false
read-timeout = "90s"
write-timeout = "90s"
query-cache-enabled = true
query-cache-size-mb = 1024
find-cache-enabled = true
trigram-index = false
scan-frequency = "5m0s"
cache-scan = true
max-globs = 100
fail-on-max-globs = false
graphite-web-10-strict-mode = false
internal-stats-dir = ""
# Calculate /render request time percentiles for the bucket, '95' means calculate 95th Percentile. To disable this feature, leave the list blank
stats-percentiles = [99, 98, 95, 75, 50]

#carbonapi configuration
listen: "0.0.0.0:8080"
cache:
  type: "mem"
  size_mb: 2000
  defaultTimeoutSec: 60
upstreams:
    graphite09compat: false
    buckets: 10
    keepAliveInterval: "30s"
    maxIdleConnsPerHost: 100
    timeouts:
        find: "10s"
        render: "30s"
        connect: "30s"
    backendsv2:
        backends:
          -
            groupName: "go_carbon"
            protocol: "carbonapi_v3_pb"
            lbMethod: "broadcast"
            maxTries: 3
            maxBatchSize: 0
            keepAliveInterval: "10s"
            maxIdleConnsPerHost: 1000
            timeouts:
                find: "10s"
                render: "30s"
                connect: "30s"
            servers:
                - "http://10.10.6.109:8080"
                - "http://10.10.6.197:8080"
                - "http://10.10.2.77:8080"
                - "http://10.10.2.24:8080"

Sep 24 '20 23:09 jdblack

Hi @jdblack As far as I understand, cache-scan = true will work only with indices, so you need or trigram-index = true and/or trie-index = true.

Sep 25 '20 09:09 deniszh

Thank you for the very quick response, @deniszh!

I have added trie-index = true and restarted go-carbon. Unfortunately I'm still seeing a substantial delay in the visibility of created metrics. My [carbonserver] section now reads thusly:

[carbonserver]
# Please NOTE: carbonserver is not intended to fully replace graphite-web
# It acts as a "REMOTE_STORAGE" for graphite-web or carbonzipper/carbonapi
listen = ":8080"
enabled = true
buckets = 10
metrics-as-counters = false
read-timeout = "90s"
write-timeout = "90s"
query-cache-enabled = true
query-cache-size-mb = 1024
find-cache-enabled = true
trie-index = true
scan-frequency = "5m0s"
cache-scan = true
max-globs = 100
fail-on-max-globs = false
graphite-web-10-strict-mode = false
internal-stats-dir = ""
# Calculate /render request time percentiles for the bucket, '95' means calculate 95th Percentile. To disable this feature, leave the list blank
stats-percentiles = [99, 98, 95, 75, 50]

carbon.create_delay, as mentioned in a previous comment, is a custom metric written by a inhouse tool that creates a random metric and times how long it takes for the metric to become visible via a graphite query.

Screen Shot 2020-09-25 at 9 26 44 AM

Sep 25 '20 16:09 jdblack

Hi, I think the the cache-scan is still influenced by scan-frequency. You might have to lower down cache-scan if you want new metrics to be available faster. Also scanning the metrics file tree and building the index would take some time. There are some logs/metrics from go-carbon telling how long they takes.

It's just my understanding from reading the commit, I think @saipraneethyss should know more.

another note, you can also curl go-carbon with a force scan request to manually trigger scanning, also added by @saipraneethyss 's MR.

curl 127.0.0.1:8080/forcescan

Sep 26 '20 07:09 bom-d-van

off topic: I think I can add a new feature to my wip concurrent trie index to make new metrics immediately available, without waiting for scanning being triggered. Should be doable.

Sep 26 '20 07:09 bom-d-van

@bom-d-van :

I think the the cache-scan is still influenced by scan-frequency.

Looks like it, but I think @saipraneethyss should know, indeed. Also in case above scan-frequency is 5 min, but metrics delayed on 10 mins....

off topic: I think I can add a new feature to my wip concurrent trie index to make new metrics immediately available, without waiting for scanning being triggered. Should be doable.

I would say that's not an offtopic at all, that would be great!

Sep 26 '20 09:09 deniszh

@jdblack : just for clarity - did you try trigram-index=true also?

Sep 26 '20 09:09 deniszh

@deniszh if go-carbon would have significantly more than max-creates-per-second new metrics, they will be delayed. Also by the value (1000) it looks like underlying storage is rather slow. And carbon in general doesn't like slow storages.

And as far as I remember, with trigram index cabronserver doesn't query cache for metrics that do not have data on disk.

If that's the case, either carbonserver should query cached metrics for those that doesn't exist or it might be a better idea to switch to clickhouse for example

Sep 26 '20 14:09 Civil

@Civil : you're describing usual go-carbon behavior, I think @jdblack is aware of it. But we're trying to understand why possible fix, which was implemented in #338 is not working. I suspecting that it works only with trigram index enabled.

Sep 26 '20 15:09 deniszh

@deniszh oh, I missed that PR. Thanks.

Sep 26 '20 16:09 Civil

I suspecting that it works only with trigram index enabled.

@deniszh it's a good point, by reading the code, it should works for both index types: https://github.com/go-graphite/go-carbon/blob/master/carbonserver/carbonserver.go#L624-L638 However, it's better to verify it by trying it out.

Just in case, @jdblack , are you using the latest release of go-carbon published by @deniszh sometime ago? Do you have the version number or commit checksum?

Sep 26 '20 17:09 bom-d-van

Wow guys! I really appreciate all of you taking such a close look at this! I'd also be very happy to share a zoom session if any of you would like to get a closer view of how this cluster looks and works.

I switched over to trigram-index and I'm still seeing the same problem. I'm using the debian version of go-carbon 0.15.0, md5sum 48406db2a1fbb7df73fbd8e5f44b437b. I have embedded my most recent carbon.conf at the bottom of this comment

A delay of 5 minutes (as per scan frequency config) is only 5 minutes. In this environment, the fetchable-after-creates delay is only 7 minutes, but in some of our production environments, the delay is as long as 40 minutes! As mentioned previously, we detect this delay with an inhouse tool that creates a completely new metric and times how long it takes for the metric to become renderable.

Our filesystems are indeed struggling (we're using compression on zfs on raidz with criminally oversized EBS volumes to buy iops). I agree that there would be no observable problem if the filesystem had plenty of breathing room. I'd like to suggest that it's a fair expectation that go-carbon+carbonserver should be able to render as yet uncreated metrics that it has in it's caches, if even if they haven't been persisted yet.

In fact, existing metrics can be pulled from the cache in the expected way with no problem. It's just metrics that have never existed before, such as when an ASG scales, causing new instances.

Thanks again for all the help!

[common]
user = "_graphite"
metric-endpoint="local"
metric-interval = "1m0s"
max-cpu = 2

[whisper]
data-dir = "/graphite/whisper"
schemas-file = "/etc/carbon/storage-schemas.conf"
aggregation-file= "/etc/carbon/storage-aggregation.conf"
workers = 2
max-updates-per-second = 500
max-creates-per-second = 500
sparse-create = true
flock = true
enabled = true
hash-filenames = false

[cache]
max-size = 5000000
write-strategy = "max"

[tcp]
listen = ":2003"
enabled = true
buffer-size = 5000000

[logging]
logger = ""
file = "/var/log/carbon/carbon.log"
level = "info"
encoding = "mixed"
encoding-time = "iso8601"
encoding-duration = "seconds"

[carbonlink]
listen  = "127.0.0.1:7002"
enabled = true
read-timeout = "30s"

[receiver.http]
protocol = "http"
listen = ":2006"

[carbonserver]
# Please NOTE: carbonserver is not intended to fully replace graphite-web
# It acts as a "REMOTE_STORAGE" for graphite-web or carbonzipper/carbonapi
listen = ":8080"
enabled = true
buckets = 10
metrics-as-counters = false
read-timeout = "90s"
write-timeout = "90s"
query-cache-enabled = true
query-cache-size-mb = 1024
find-cache-enabled = true
trigram-index = true
scan-frequency = "5m0s"
cache-scan = true
max-globs = 100
fail-on-max-globs = false
graphite-web-10-strict-mode = false
internal-stats-dir = ""
# Calculate /render request time percentiles for the bucket, '95' means calculate 95th Percentile. To disable this feature, leave the list blank
stats-percentiles = [99, 98, 95, 75, 50]

Sep 28 '20 17:09 jdblack

I also seem to be having a problem, since switching to go-carbon 0.15 and enabling carbonserver and the trie options, where it seems load can temporarily shoot up from around 5 to over 50 for no explicable reason. Could that be the cache-scan option cruising the entire filesystem?

I'm starting to wonder if a lot of our pain and misery is coming from using ZFS for our backing store.

Sep 28 '20 18:09 jdblack

A delay of 5 minutes (as per scan frequency config) is only 5 minutes. In this environment, the fetchable-after-creates delay is only 7 minutes, but in some of our production environments, the delay is as long as 40 minutes!

@jdblack By your description, it seems to be an issue of those servers? maybe you can check if there is anything wrong with those servers.

It isn't that bad to have 7 mins of delay when using a scan frequency of 5 mins. How many metrics do you have?

You can find some metrics exported by go-carbon on indexing and identify the issues on those servers:

carbon.agents.{host}.metrics_known
carbon.agents.{host}.index_build_time_ns
carbon.agents.{host}.file_scan_time_ns

carbon.agents.{host} is the default prefix. These metrics are also logged in /var/log/carbon/carbon.log, you should be able grep them.

Sep 28 '20 18:09 bom-d-van

I also seem to be having a problem, since switching to go-carbon 0.15 and enabling carbonserver and the trie options, where it seems load can temporarily shoot up from around 5 to over 50 for no explicable reason. Could that be the cache-scan option cruising the entire filesystem?

That's new to me, in our case, using trie have less cpu usage, especially when dealing high number of metrics. Both trigram and trie index scans the file systems.

I'm starting to wonder if a lot of our pain and misery is coming from using ZFS for our backing store.

Our graphite team is using xfs.

Sep 28 '20 18:09 bom-d-van

Yes, a delay of 5-7 minutes would be fine. The environment that we're testing in has a delay of ~ 7-11 minutes. The problem that we're really trying to address is a delay of 40 minutes in our big production environments!

Sep 28 '20 18:09 jdblack

Have you try running a find command on your production environment? How many metrics do you have there? I think you might an issue on file system or hardware level.

time find /graphite/whisper -type f > /dev/null

Sep 28 '20 18:09 bom-d-van

Yeah, I suspect those lag spikes are coming from ZFS causing load spikes whenever the filesystem is fully scanned.

One thing I'm unclear about; why are the tries required for scanning the filesystem when it's already known that the wsp files dont yet exist on disk?

Sep 28 '20 18:09 jdblack

One thing I'm unclear about; why are the tries required for scanning the filesystem when it's already known that the wsp files dont yet exist on disk?

That's the way indexing works currently in go-carbon, both trigram and trie. The program scans the whole data-dir every scan-frequency to figure out what are the new metrics and cache them in memory.

Sep 28 '20 18:09 bom-d-van

I would also suggest running sudo perf top to see what the cpu is doing.

Sep 28 '20 18:09 bom-d-van

I'm very sorry, I must be misunderstanding. Our problem is that we're not able to find/render metric series that do not yet exist on disk at all. In other words, there is no whisper file yet at all. How does scanning the filesystem for a file that we know doesn't exist help make them renderable?

Sep 28 '20 18:09 jdblack

To summary: For go-carbon, usually you can only query metrics that are already created on disk. But with cache-scan, you can query metrics that haven't been created on disk, but it only works when trigram or trie index are enabled.

So to solve your problems of having newly created metrics available faster, you have to either make sure the your go-carbon program flush metrics cache to disk faster or use cache-scan with trie or trigram index, which requires scanning your file system.

How does scanning the filesystem for a file that we know doesn't exist help make them renderable?

cache-scan have a feature of returning metrics are still in cache, even though they aren't still not created on disk.

Sep 28 '20 18:09 bom-d-van

Ahh, but I have found that metrics series that do not exist yet (as in no .wsp file yet) are -not- renderable even if cache-scan and trie-index / trigram-index are turned on.

Sep 28 '20 21:09 jdblack

@jdblack : v0.15.3 was released, including concurrent trie index. You can try did it work for new metrics with

trie-index = true
concurrent-index = true
realtime-index = 100
cache-scan = false

If it works then we probably need to remove cache-scan code, because it's not really working.

@bom-d-van : what's reasonable value for realtime-index? 10, 100, 1000, more?

Oct 25 '20 15:10 deniszh

what's reasonable value for realtime-index? 10, 100, 1000, more?

Hi, I think a few thousands should be good enough. Unless the go-carbon instance receives new metrics in a super high frequency. The reason that it's a buffer is to be able to deal with burst of new metrics and have them availble.

remove cache-scan code, because it's not really working.

Oh, for this, realtime-index reuses some part of cache-scan logics, and realtime-index only supports trie-index at the moment. So we might still have to keep it.

Oct 25 '20 17:10 bom-d-van

Hi all! sorry for missing out on the earlier discussion. @bom-d-van thanks for pitching in with your valuable responses.

remove cache-scan code, because it's not really working.

We have been running the go-carbon instance with the cache-scan code on our production machine close to 6 months now where we handle more than a 1 Million metrics. With this change we were able to bring down our query-response time for new metrics from more than 30 minutes to less than 5 minutes ( 5m0s is our scan-frequency ) And so far, this has been working for us fairly well.

Oct 26 '20 08:10 saipraneethyss

And so far, this has been working for us fairly well.

@saipraneethyss : I can only 🤷 here, because according to @jdblack it is not working. But it's quite hard to reproduce and test, indeed.

Oct 26 '20 10:10 deniszh

I'll give 0.15.3 a try!

Some quick updates:

Switching from ZFS to XFS had -drastic- improvements on timings
We see a sumSeries(perSecond(carbon.relays.*.metricsReceived)) of about 2.3 million in our largest environment. Does this mean our cluster sees 2.3 million data points per sevond?

Our largest environment has two graphite-cache clusters, a shiny new cluster of 36 instances with XFS, and an old cluster of 36 instances with XFS that is now read only while the xfs cluster fills. The new cluster is about 25 days old, while the old cluster goes back about a year.

Our new cluster based on xfs has 1.19 million WSPs on each cache and it takes 1m54s on average to perform a find /graphite/whisper -type f.
The old (now read only) zfs cluster has 3.99M WSP files in each cache. A find /graphite/whisper -type f on these caches usually takes 7M2S to perform.

We ended up having to disable cache_scan as walking the entire filesystem every few minutes was causing lookups against our cluster to collapse as the recurring find operation just sucked away a big chunk of our IOPs.

I think our problem is that the metrics are sitting in go-carbon's input cache, totally hidden until they are processed by go-carbon

When our developers scale up new instances, those new instances push out new metrics under a new namespace of servicename.${IP}.rest.of.the.namespace). These brand new metrics are not renderable until after they have been persisted to disk. This behavior differs for new data points for existing metrics, which metrics in the memory cache -immediately- became renderable, even though the metrics are still sitting in memory.

To say this again; we have no problems with being able to look up fresh data for wsp files that have been around a long time, but we can't seem to look up metrics in carbonserver for metrics for which the wsp does not yet exist. Its as if go-carbon requires that the wsp be at least created before metrics are renderable from cache.

I suspect that our pain comes from the metrics sitting unprocessed in go-carbon's memory cache. Is it possible that there is a bug in which go-carbon is able to render new data points for long lived metrics from cache, but data points for entirely new metrics (as in no wsp previously existed) are not available until after they have been persisted at least once?

Switching from ZFS to XFS gave us a -huge- boost metric write speeds, allowing us to boost max-updates from 500 to 1500 (with IOPS left over for lookups!). This cut our delay from 30-40 minutes to find new metrics down to 9 minutes. The lower we get carbon.agents.*.cache.size, the lower the delay it takes for new metrics to be renderable, which totally make sense.

Oct 27 '20 19:10 jdblack

I upgraded our dev clusters to v0.15.5 and I'm still able to see the non-visualizable effects. This problem is easy to reproduce. Just set max-updates-per-second to value that is low enough to push up the points per update. I performed some tests with one of our smaller test clusters, that only gets ~ 25K/sec metrics.

Adjusting max-updates-per-second had a direct result in the amount of time for newly created metrics to become renderable. New data points for existing metrics were always immediately available. In other words, e never saw lag on our tests for total relay metrics sent and received or for metrics received per cache. We only saw lag on "Graphite Create Delays", which is an internal tool that creates a metric named carbon.create_delay.$RANDOM =timestamp every few seconds and measure how long it takes to become renderable.

Timings from our test:

max-updates-per-second: 450 points per update: 2 delay for novel metrics: 4.5 min

max-updates-per-second: 50 points per update: 17 delay for novel metrics: 30 - 50 min

max-updates-per-second: 1000 points per update: 1 delay for novel metrics : 1.25 min

carbon.conf.txt

Screen Shot 2020-10-29 at 3 36 45 PM

Oct 29 '20 23:10 jdblack

Your scan-frequency is "0", try going with "5m".

Oct 30 '20 19:10 bom-d-van

Are you asking me to set a scan-frequency of "5m" with cache-scan set to disabled?

We had a scan-frequency of 5m0s when cache-scan was turned on (you can see that in my September 28 comment), but it caused our production clusters to fall over, as it took 7 minutes to perform a filesystem scan, starving the filesystem of enough IOPs to perform reads and writes. There's more details in a previous comment (starting with "I'll give 0.15.3 a try!")

Oct 30 '20 20:10 jdblack

go-carbon go-carbon copied to clipboard

[BUG] newly created metrics not fetchable even with cache-scan turned on

go-carbon
go-carbon copied to clipboard