etcd-backup-restore icon indicating copy to clipboard operation
etcd-backup-restore copied to clipboard

Restoration of higher db size takes more time and etcd restarts 5-6 times

Open nikhilsap opened this issue 6 years ago • 9 comments

As a part of performance test, we tried to restore backup of 2gb , etcd container went for 5-6 restarts and it restored the data. It took around 4-5 mins to restore the data after 5-6 restarts. We checked the memory consumption also, it is not going for OOM.

nikhilsap avatar Jan 29 '19 10:01 nikhilsap

/assign @swapnilgm @amshuman-kr

nikhilsap avatar Jan 29 '19 10:01 nikhilsap

Restoration time is dependent on DB/backup size and network speed. And since restoration is ongoing process, etcd-backup-restore health status will be false. For etcd container the liveness probe check the /healthz endpoint of etcd-backup-restore sidecar. This is by design. So, in all higher the db size/ lower network speed -> restoration will be longer -> /healthz endpint will return false -> liveness prove on etcd container will fail -> etcd restarts.

Closing this for now. Feel free to reopen is you have any further queries regarding same.

swapnilgm avatar Jan 29 '19 19:01 swapnilgm

@swapnilgm This seems to be an undesirable behaviour. We can pick it up on a lower priority. But it is a genuine issue, I think.

amshuman-kr avatar Jan 30 '19 04:01 amshuman-kr

@amshuman-kr May be you guys had detailed discussion around this. And if my comment above don't address the issue then probably i didn't understood the core issue. Probably, the description of issue provided in issue need to be elaborated more then. Anyways, i'll reopen the issue for the further discussion.

swapnilgm avatar Jan 30 '19 06:01 swapnilgm

@swapnilgm There was no separate discussion as such. I just think that etcd container restarting multiple times if restoration takes time is not nice. If it happens too many times it might even end up going into crashloopbackoff. That would not be nice at all.

Why not use readinessProbe instead? WDYT?

amshuman-kr avatar Jan 30 '19 10:01 amshuman-kr

Thanks Amshu for triggering discussion again. That help me found out a correction to above explanation. livenessProbe is on etcd i.e. etcdctl get foo. So, by design , in all higher db size/ lower network speed scenario-> restoration will be longer -> etcd fails to bootstrap -> liveness probe on etcd container will fail -> etcd restarts.

Now coming back to next point regarding CrashLoopbackoff, yes agreed. But Crashloopbackoff logic is under k8s control. We can't get rid of liveness prove, since we want k8s to restart etcd in case, its failing to serve request.

Why not use readinessProbe instead? WDYT?

I didn't get, what are you suggesting to be readiness probe and how will it work?

swapnilgm avatar Jan 30 '19 12:01 swapnilgm

livenessProbe is on etcd i.e. etcdctl get foo. So, by design , in all higher db size/ lower network speed scenario-> restoration will be longer -> etcd fails to bootstrap -> liveness probe on etcd container will fail -> etcd restarts.

Understood. This is all due to the coupling of restoration with restart.

We can't get rid of liveness prove, since we want k8s to restart etcd in case, its failing to serve request.

Agreed. But it is not nice to have the etcd container potentially going into CrashLoopbackoff for slower restores by design. If we decouple the restoration from restart then we can have our livenessProbe and eat the CrashLoopbackoff too :-)

Why not use readinessProbe instead? WDYT?

I didn't get, what are you suggesting to be readiness probe and how will it work?

My mistake. I did not think it through :-)

Due to the above points, I am fine if you want to keep this issue open until we implement the decoupling of restore and restart or close it for now and create another issue when we want to pick it up.

BTW, @shreyas-s-rao has a sample data of 1.3GB where verification takes quite a bit of time and causes issues. I am sure restoration would have issues with that database too.

amshuman-kr avatar Jan 31 '19 04:01 amshuman-kr

Yes Amshu, there is a chance for a potential CrashLoopBackOff. We'll see if we can improve our restart-restoration logic to avoid decoupling.

BTW, @shreyas-s-rao has a sample data of 1.3GB where verification takes quite a bit of time and causes issues. I am sure restoration would have issues with that database too

The issue we faced was not necessarily because verification took too long, but because of a shortage of memory. Nonetheless, the backup itself is 1.5GB, so even if verification happens quickly (given enough memory), the restoration will take time, so we could definitely use it for running tests related to the restart-restore logic.

shreyas-s-rao avatar Jan 31 '19 05:01 shreyas-s-rao

The long duration part is to be addressed via https://github.com/gardener/etcd-druid/issues/88.

amshuman-kr avatar Oct 28 '20 06:10 amshuman-kr

Closing this issue as it is addressed after compaction enablement. Please reopen if needed

abdasgupta avatar Jan 05 '23 06:01 abdasgupta