VictoriaMetrics
VictoriaMetrics copied to clipboard
vmagent panic on remoteWrite.streamAggr.dedupInterval
Describe the bug
vmagent crashes periodically when the -remoteWrite.streamAggr.dedupInterval="0s,120s"
flag set.
To Reproduce
vmagent configured with remoteWrite.streamAggr.dedupInterval
configuration:
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMAgent
metadata:
name: vmagent-multi-retention
namespace: victoria-metrics
spec:
image:
tag: v1.101.0
selectAllByDefault: true
replicaCount: 1
scrapeInterval: 20s
scrapeTimeout: 10s
externalLabels:
cluster: mycluster
extraArgs:
promscrape.streamParse: 'true'
remoteWrite.streamAggr.dedupInterval: "0s,120s"
statefulMode: true
statefulStorage:
volumeClaimTemplate:
spec:
resources:
requests:
storage: 20Gi
remoteWrite:
- url: "http://vminsert-vmcluster-retention-1m.victoria-metrics.svc:8480/insert/0/prometheus/api/v1/write"
- url: "http://vminsert-vmcluster-retention-3m.victoria-metrics.svc:8480/insert/0/prometheus/api/v1/write"
Version
./vmagent-prod --version vmagent-20240425-145801-tags-v1.101.0-0-g5334f0c2c
Logs
panic: runtime error: index out of range [6] with length 0
goroutine 15146 [running]:
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite.(*writeRequest).copyTimeSeries(0xc000000008, 0xc004a236e0, 0xc000a796e8)
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite/pendingseries.go:207 +0x6a9
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite.(*writeRequest).tryPush(0xc000000008, {0xc000a72008, 0x283, 0xc0004f8820?})
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite/pendingseries.go:192 +0x6d
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite.(*pendingSeries).TryPush(0xc000000000, {0xc000a72008?, 0x40c025?, 0x10?})
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite/pendingseries.go:64 +0x67
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite.(*remoteWriteCtx).tryPushInternal(0x8?, {0xc000a72008?, 0x0?, 0xc00013c510?})
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:1015 +0x1c5
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite.(*remoteWriteCtx).TryPush(0xc000099b60, {0xc000a72008?, 0x10a20?, 0xc0000a3950?})
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:957 +0x605
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite.tryPushBlockToRemoteStorages.func1(0xc00117aeac?)
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:593 +0x65
created by github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite.tryPushBlockToRemoteStorages in goroutine 49
github.com/VictoriaMetrics/VictoriaMetrics/app/vmagent/remotewrite/remotewrite.go:591 +0xea
Screenshots
No response
Used command-line flags
command-line flags -httpListenAddr=":8429" -promscrape.config="/etc/vmagent/config_out/vmagent.env.yaml" -promscrape.streamParse="true" -remoteWrite.maxDiskUsagePerURL="1073741824" -remoteWrite.streamAggr.dedupInterval="0s,2m0s" -remoteWrite.tmpDataPath="/vmagent_pq/vmagent-remotewrite-data" -remoteWrite.url="secret"
Additional information
No response
Thanks for report! This looks like race condition. @AndrewChubatiuk would you mind taking a look?
It only happens when you have multiple remotewrite targets with:
- some of them runs with deduplicator.
- others don't.
The remotewrite (with deduplicator) Push
data here:
https://github.com/VictoriaMetrics/VictoriaMetrics/blob/5334f0c2ce91d975d22012546d882917c0ff5fcf/app/vmagent/remotewrite/remotewrite.go#L951
And clear(tss)
While the remotewrite (without deduplicator) Push
data here:
https://github.com/VictoriaMetrics/VictoriaMetrics/blob/5334f0c2ce91d975d22012546d882917c0ff5fcf/app/vmagent/remotewrite/remotewrite.go#L957
And here's the critical part:
https://github.com/VictoriaMetrics/VictoriaMetrics/blob/5334f0c2ce91d975d22012546d882917c0ff5fcf/app/vmagent/remotewrite/pendingseries.go#L181
The goroutine (without deduplicator) refer timeseries data with index tsSrc := &src[i]
, where the timeseries data might be cleared.
While the goroutine(with deduplicator) refer timeseries data with a copy:
for _, ts := range tss {
It could be reproduced whenever you have:
- some remotewrites go with the
deduplicator
path. (dedupInterval != 0s
) - some remotewrites go with the normal path. (
dedupInterval = 0s
)
Hope this could help
@alexintech just curious if you change the order - 120s,0s
will it also cause an error?
It'd be great to build vmagent with race detector: make vmagent-race
and test it for possible data races.
Note, it significantly reduces performance of application and must be used only for testing.
the most obvious reason is this as mentioned by @jiekun, I've reproduces an issue as well and I've tested these changes @alexintech you can try this if you want
@alexintech just curious if you change the order -
120s,0s
will it also cause an error?
The same error, but it crashes quicker, just after the start.
@alexintech you can try this if you want
I'll check
Re-opening issue since https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6206 isn't released yet. It will be included into the next release.
This issue should be fixed in v1.102.0-rc1 release.
FYI, see the follow-up commit 4f99799db706790af7fd79a47d0c00ae720af006