velero Avoid unnecessary restic unlocks

This greatly reduces amount of s3 requests when idling fixes #2820

Aug 01 '22 17:08 Szpadel

Restic lock is a "file lock" persisted in the backend storage, the purpose is to acquire access to the repository. Every backup tries to acquire the lock first before writing any data to the repository. In Velero's scenarios, a stale lock is a Restic lock left when Velero server, Velero daemonset or Restic process crashes, a Restic lock is regarded as stale when it exists longer than 30 min.

As the old code, as long as the repository's phase is not nil or New, Velero will check for the stale locks. With the changes in the current PR, the checks are run when:

The repository's phase is NotReady
The repository's phase is Ready and the maintenance is due (by default, the due time is 7 days)

This means, for a Ready repository, the stale lock is checked every 7 days, this is significantly longer than the lock stale time, 30 min. The consequence is, there will be unexpected stale locks left there for much longer time.

Finally, the impacts are as below:

Some operations create exclusive locks, for example, forget, maintenance. If an exclusive is there, no one could lock the repo at the same time. Suppose Velero starts a forget CLI, then it crashes, the exclusive lock will be left there for at least 7 days. During this time, no backup/restore could run as these Restic operations create a non-exclusive lock.
Nearly every Restic operation creates a non-exclusive lock. Restic has a lock refresh mechanism, which recreates the locks every 5 min, if there are too many stale locks, there will be more access to the backend storage

Aug 02 '22 02:08 Lyndon-Li

Thank you for fast response and detailed explanation.

If I understand restic code correctly, it tries to clean up locks also when they are created, therefore creating backup/restore operations should not be affected, but even if I understood this incorrectly, then maybe better would be adding this cleanup also before backup/restore? In that scenario cleaning up locks every 7 days when nothing is using repository is just cosmetic issue, as those locks are not valid anyway, and as locks are small it isn't big deal to keep them on backing storage, while constant checking if they are there generates much higher costs.

I do not understand what you mean by "if there are too many stale locks, there will be more access to the backend storage". restic refreshes locks only when there is some operation running, therefore those locks are not refreshed the whole time when velero is idling.

Aug 02 '22 05:08 Szpadel

then maybe better would be adding this cleanup also before backup/restore

This looks like a good idea but instead of doing this for backup/restore only, we need to do this before running every Restic commands, because as mentioned, nearly every Restic operation needs to acquire a non-exclusive lock

I do not understand what you mean by "if there are too many stale locks, there will be more access to the backend storage"

Indeed, the refreshLocks go routine is launched whenever an operation tries to acquire a lock. I didn't mean there are storage access during idle time, but if the stale locks are not deleted in time, for each operation, there will be multiple access requests for each stale lock, by that, I meant "more access". However, if we can delete the sale locks in time (for example, delete before every operation or periodically clear them as the current behavior), the problem will not happen.

Aug 02 '22 05:08 Lyndon-Li

but if the stale locks are not deleted in time, for each operation, there will be multiple access requests for each stale lock

Currently we have thousands of requests to prevent stale locks, event if there wasn't any crash, those for sure are more expensive in general.

I think the main question is if restic is recognising that stale locks are no longer valid, or that prevents any operations. Because it that's not preventing operations, then IMO clearing them every 7days might not be that bad.

Because stale locks can happen only after crash (right?) maybe this is enough to only clear them on start, as if there was any crash or abnormal abort (like reboot/power failure), velero will be eventually restarted and while it is successfully running there should be any new stale locks.

Aug 02 '22 06:08 Szpadel

I think the main question is if restic is recognising that stale locks are no longer valid

I don't think so, otherwise, Velero will not need to add the stale lock logics

or that prevents any operations

Yes, it does. If there is an exclusive lock already, no other lock could be created, means, no other operations could be done

Aug 02 '22 07:08 Lyndon-Li

but what do you think about second part?

Because stale locks can happen only after crash (right?) maybe this is enough to only clear them on start, as if there was any crash or abnormal abort (like reboot/power failure), velero will be eventually restarted and while it is successfully running there should be any new stale locks.

could that be alternative solution?

Aug 02 '22 07:08 Szpadel

There are several cases that stale locks are possibly left, known ones:

Velero server abnormally quit for any reason
Velero daemonset abnormally quit for any reason
Restic process abnormally quit for any reason

What you have mentioned cannot solve case 3. Eventually, safely doing stale lock checks come down to one place: before/after each Restic operation, that is, before/after each Restic command. This way is not complex, because Velero has one central place to launch the Restic commands.

Aug 02 '22 07:08 Lyndon-Li

What do you think now?

I couldn't add this to exec because this would cause infinite loop as UnlockRepo also uses that.

I also included throttle to only check after min 30min since last successful cleanup - to not slow down multiple operations on single repository - like backup/restore

Aug 02 '22 09:08 Szpadel

@Szpadel Thanks for the quick action.

I think it is possible to add the check in the exec function, that will be like we call veleroexec.RunCommand one more time, for the unlock command. And I believe this will make the code more clean.

Give us some time to discuss this in the entire team, before that, you can keep the code as is, we may add more suggestions later.

Again, thank you for your contribution to Velero.

Aug 02 '22 13:08 Lyndon-Li

@Szpadel Thanks for this PR! I noticed the issue #2820 was closed before we started working on velero project, and the comment seems to remain valid as we continue moving away from restic: https://github.com/vmware-tanzu/velero/issues/2820#issuecomment-839196376

Hence, I wish to suggest that we do not merge it to main branch currently and you may keep this PR to yourself if it's helpful for your use case.

Aug 08 '22 06:08 reasonerjt

@reasonerjt Unfortunately maintaining velero fork isn't an option for me.

Can you at least share real reason why this cannot be merged?

restic integration isn't mentioned to be deprecated anywhere. other PR for restic are still merged, even after this PR was created (https://github.com/vmware-tanzu/velero/pulls?q=is%3Apr+restic+is%3Aclosed). linked issue is 2 years old, and if you move away from restic, what have been done in meantime? What is alternative?

sorry but lying to community that provide free work for project isn't great for isn't future

I any of what you suggesting is true, please at least document that restic integration as deprecated/unsupported on project page so people that start new integrations could be aware that they should to look for alternative before they spent time at integration

Aug 08 '22 08:08 Szpadel

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Oct 08 '22 09:10 stale[bot]

Closing the stale issue.

Oct 22 '22 19:10 stale[bot]

velero velero copied to clipboard

Avoid unnecessary restic unlocks

velero
velero copied to clipboard