etcd-cloud-operator
etcd-cloud-operator copied to clipboard
How to troubleshoot memory issues in etcd
Hi, I would like to know how we can troubleshoot memory issue in etcd and how and how to mitigate such memory issues?
Hey!
Like you said - you'd be looking at etcd itself - as the operator's own memory usage is going to be very minimal, best to refer to their repository / docs / code. Etcd is started as an embedded server though as part of the etcd-cloud-operator, so it may first seem as if the operator is taking up memory.
I think the memory spike is due to S3 backup. How do I disable S3 backup? Also how and where do I need to add profiling --> https://github.com/google/pprof to check the memory profile?
Th snapshot providers streams the data from etcd towards the snapshot destination, so I'd think it'd be ok if everything is implemented alright - unless etcd itself has a memory spike as part of the save somehow. Do you have a memory chart?
Disabling S3 snapshots is not recommended as this will cripple your ability to do disaster recovery, unless you enable the file backup provider with a separate and reliable storage to use. By default, the operator requires a snapshot provider.
To enable pprof, you'd want to inject it in the main here behind a command-line flag:
import (
pprof "net/http/pprof"
)
if flagPprof != nil && len(flagPprof) > 0 {
go func() {
zap.S().Infof("enabling pprof on %s", flagPprof)
pprof.ListenAndServe(flagPprof, nil)
}
}
The baseline has shifted and memory is heaping and I can see that these spike happening during the backup to S3 can I like make an adjustment to this
snapshot:
provider: s3 # This should be configured to S3 in any real environments.
interval: 30m
ttl: 24h
So the backup is not very aggressive? Maybe increase the interval or reduce the TTL. If then what need to be the desired values here?
Ideally this backup activity should be happening in non peak hours. How to set the time to do the backup once in a week during off peak hours?
Can you please help here?