etcd-cloud-operator icon indicating copy to clipboard operation
etcd-cloud-operator copied to clipboard

How to troubleshoot memory issues in etcd

Open iamnst19 opened this issue 1 year ago • 6 comments
trafficstars

Hi, I would like to know how we can troubleshoot memory issue in etcd and how and how to mitigate such memory issues?

iamnst19 avatar Jun 26 '24 09:06 iamnst19

Hey!

Like you said - you'd be looking at etcd itself - as the operator's own memory usage is going to be very minimal, best to refer to their repository / docs / code. Etcd is started as an embedded server though as part of the etcd-cloud-operator, so it may first seem as if the operator is taking up memory.

Quentin-M avatar Jun 26 '24 09:06 Quentin-M

I think the memory spike is due to S3 backup. How do I disable S3 backup? Also how and where do I need to add profiling --> https://github.com/google/pprof to check the memory profile?

iamnst19 avatar Jul 10 '24 18:07 iamnst19

Th snapshot providers streams the data from etcd towards the snapshot destination, so I'd think it'd be ok if everything is implemented alright - unless etcd itself has a memory spike as part of the save somehow. Do you have a memory chart?

Disabling S3 snapshots is not recommended as this will cripple your ability to do disaster recovery, unless you enable the file backup provider with a separate and reliable storage to use. By default, the operator requires a snapshot provider.

To enable pprof, you'd want to inject it in the main here behind a command-line flag:

import (
  pprof "net/http/pprof"
)

if flagPprof != nil && len(flagPprof) > 0 {
  go func() {
    zap.S().Infof("enabling pprof on %s", flagPprof)
    pprof.ListenAndServe(flagPprof, nil)
  }
}

Quentin-M avatar Jul 10 '24 22:07 Quentin-M

Screenshot 2024-07-11 at 11 18 51 AM

The baseline has shifted and memory is heaping and I can see that these spike happening during the backup to S3 can I like make an adjustment to this

snapshot:
    provider: s3 # This should be configured to S3 in any real environments.
    interval: 30m
    ttl: 24h

So the backup is not very aggressive? Maybe increase the interval or reduce the TTL. If then what need to be the desired values here?

iamnst19 avatar Jul 11 '24 05:07 iamnst19

Ideally this backup activity should be happening in non peak hours. How to set the time to do the backup once in a week during off peak hours?

iamnst19 avatar Jul 12 '24 10:07 iamnst19

Can you please help here?

iamnst19 avatar Jul 23 '24 19:07 iamnst19