m3
m3 copied to clipboard
A subset of metrics disappear from m3db after some rounds of scale up/down
Filing M3 Issues
General Issues
We recently had an m3db outage where a subset of metrics just "disappeared" for some time period and couldn't be queried. Here is the series of events that we think caused this to happen:
- We scaled up m3db cluster from 1 replica/isolation group -> 2 replicas/isolation group. We have 3 replication groups.
- We scaled down the cluster from 2 replicas/isolation group back to 1 replica/isolation group
- We scaled up the cluster from 1 replica/isolation group to 2 replicas/isolation group
After step 3, we started to see some metrics disappear from the cluster and can't be queried anymore (there was a metric gap for some metrics after step 3). All writes and reads to the cluster were successful and there were no failures. One thing worth to mention is, when the new replicas came up from the second scale-up in step 3, they were using the same disks that were provisioned by the reps that were brought up in step 1, which upon a quick look had some index data, but I don't think it had any actual metrics data.
To mitigate this, we scaled down the cluster by editing the placement, deleted the old disks and then scaled the cluster back up with new provisioned disks. The metrics started to work normally, and they reappeared even during the incident time so there was no longer a metric gap.
What service is experiencing the issue? (M3Coordinator, M3DB, M3Aggregator, etc)
M3db v1.3.0
What is the configuration of the service? Please include any YAML files, as well as namespace / placement configuration (with any sensitive information anonymized if necessary).
Here is the m3db configuration yaml
spec:
annotations:
...
configMapName: m3db-config-map
containerResources:
limits:
cpu: 8.5
memory: 88Gi
requests:
cpu: 8.5
memory: 88Gi
dataDirVolumeClaimTemplate:
metadata:
creationTimestamp: null
name: m3db-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Ti
storageClassName: m3db-storage-class
status: {}
dnsPolicy: ClusterFirstWithHostNet
etcdEndpoints:
- http://etcd-0.etcd.m3.svc.cluster.local:2379
- http://etcd-1.etcd.m3.svc.cluster.local:2379
- http://etcd-2.etcd.m3.svc.cluster.local:2379
hostNetwork: true
image: ...
isolationGroups:
- name: group1
nodeAffinityTerms:
- key: pool
values:
- m3
numInstances: 2
- name: group2
nodeAffinityTerms:
- key: pool
values:
- m3
numInstances: 2
- name: group3
nodeAffinityTerms:
- key: pool
values:
- m3
numInstances: 2
namespaces:
- name: default
options:
bootstrapEnabled: true
cleanupEnabled: true
flushEnabled: true
indexOptions:
blockSize: 8h
enabled: true
repairEnabled: true
retentionOptions:
blockDataExpiry: true
blockDataExpiryAfterNotAccessPeriod: 5m
blockSize: 8h
bufferFuture: 10m
bufferPast: 15m
retentionPeriod: 2160h
snapshotEnabled: true
writesToCommitLog: true
numberOfShards: 64
parallelPodManagement: true
podIdentityConfig:
sources: []
podMetadata:
creationTimestamp: null
priorityClassName: m3db
replicationFactor: 3
How are you using the service? For example, are you performing read/writes to the service via Prometheus, or are you using a custom script?
We're performing reads and writes through m3 coordinators
Is there a reliable way to reproduce the behavior? If so, please provide detailed instructions.
We haven't yet attempted to reproduce this issue yet, but we wanted to see if the series of events provided above is not something that should be done in general
Please let me know if you need any more details/configs. Is the series of events done above not meant to work ?