etcd-backup-restore
etcd-backup-restore copied to clipboard
Restoration of higher db size takes more time and etcd restarts 5-6 times
As a part of performance test, we tried to restore backup of 2gb , etcd container went for 5-6 restarts and it restored the data. It took around 4-5 mins to restore the data after 5-6 restarts. We checked the memory consumption also, it is not going for OOM.
/assign @swapnilgm @amshuman-kr
Restoration time is dependent on DB/backup size and network speed. And since restoration is ongoing process, etcd-backup-restore health status will be false. For etcd container the liveness probe check the /healthz
endpoint of etcd-backup-restore sidecar. This is by design. So, in all higher the db size/ lower network speed -> restoration will be longer -> /healthz
endpint will return false
-> liveness prove on etcd container will fail -> etcd restarts.
Closing this for now. Feel free to reopen is you have any further queries regarding same.
@swapnilgm This seems to be an undesirable behaviour. We can pick it up on a lower priority. But it is a genuine issue, I think.
@amshuman-kr May be you guys had detailed discussion around this. And if my comment above don't address the issue then probably i didn't understood the core issue. Probably, the description of issue provided in issue need to be elaborated more then. Anyways, i'll reopen the issue for the further discussion.
@swapnilgm There was no separate discussion as such. I just think that etcd container restarting multiple times if restoration takes time is not nice. If it happens too many times it might even end up going into crashloopbackoff. That would not be nice at all.
Why not use readinessProbe instead? WDYT?
Thanks Amshu for triggering discussion again. That help me found out a correction to above explanation. livenessProbe
is on etcd i.e. etcdctl get foo
.
So, by design , in all higher db size/ lower network speed scenario-> restoration will be longer -> etcd fails to bootstrap -> liveness probe on etcd container will fail -> etcd restarts.
Now coming back to next point regarding CrashLoopbackoff
, yes agreed. But Crashloopbackoff logic is under k8s control. We can't get rid of liveness prove, since we want k8s to restart etcd in case, its failing to serve request.
Why not use readinessProbe instead? WDYT?
I didn't get, what are you suggesting to be readiness probe and how will it work?
livenessProbe is on etcd i.e. etcdctl get foo. So, by design , in all higher db size/ lower network speed scenario-> restoration will be longer -> etcd fails to bootstrap -> liveness probe on etcd container will fail -> etcd restarts.
Understood. This is all due to the coupling of restoration with restart.
We can't get rid of liveness prove, since we want k8s to restart etcd in case, its failing to serve request.
Agreed. But it is not nice to have the etcd
container potentially going into CrashLoopbackoff
for slower restores by design. If we decouple the restoration from restart then we can have our livenessProbe
and eat the CrashLoopbackoff
too :-)
Why not use readinessProbe instead? WDYT?
I didn't get, what are you suggesting to be readiness probe and how will it work?
My mistake. I did not think it through :-)
Due to the above points, I am fine if you want to keep this issue open until we implement the decoupling of restore and restart or close it for now and create another issue when we want to pick it up.
BTW, @shreyas-s-rao has a sample data of 1.3GB where verification takes quite a bit of time and causes issues. I am sure restoration would have issues with that database too.
Yes Amshu, there is a chance for a potential CrashLoopBackOff
. We'll see if we can improve our restart-restoration logic to avoid decoupling.
BTW, @shreyas-s-rao has a sample data of 1.3GB where verification takes quite a bit of time and causes issues. I am sure restoration would have issues with that database too
The issue we faced was not necessarily because verification took too long, but because of a shortage of memory. Nonetheless, the backup itself is 1.5GB, so even if verification happens quickly (given enough memory), the restoration will take time, so we could definitely use it for running tests related to the restart-restore logic.
The long duration part is to be addressed via https://github.com/gardener/etcd-druid/issues/88.
Closing this issue as it is addressed after compaction enablement. Please reopen if needed