argo-cd
argo-cd copied to clipboard
Timeout for sync operations
Summary
At the moment, if for whatever reason the sync process gets stuck (e.g. because some resource fails to start up properly and keeps on retrying), the sync will never complete and will keep on "Syncing".
There should be an option to add a timeout, after which the sync process would terminate. Depending on selfHeal
rules, etc, there may be a need to automatically retry, or alternatively, the application should just stay in the failed state until manually resolved.
Did my best to search for similar requests, aside from a brief note in #1886, couldn't find anything - sorry if I missed it.
Motivation
At the moment, we've set up alerting for sync operations that are taking too long, which at least notifies someone to look at things and usually means a manual intervention.
When an application is in a "Syncing" state, manual intervention becomes rather tricky - one cannot delete resource to get them recreated (esp. when things are stuck in some sync wave), or perform a partial sync, etc.
Moreover, simply hitting "Terminate" is not always sufficient if the application has autosync enabled, as it would just retry, putting it into a forever "Syncing" state. Disabling autosync in some cases might also be problematic and require multiple steps, because it might be set from a parent application - which means that the parent application autosync also needs to be disabled (so that it does not just resync and re-enable the autosync).
Proposal
syncPolicy:
syncTimeout: 600 # seconds, default: unlimited
onSyncTimeout: "fail" # or "retry" (?), or "waitForUpdate" (?)
Some of the things that might need consideration:
- Should
selfHeal
just retry? Or should that be configurable? The previous sync might not have completed in full, so hooks/postsync actions might not have executed. - Should new commits result in a new sync operation? Same as above, essentially. Arguably, new commits could be the fix.
I would like to work on this issue.
Is there any update on that?
@RaviHari is there any update on this. been in this issue because pre-hook failed and its locked to always sync state
@prima101112 and @hanzala1234 sorry for delay.. I will get started on this and keep you posted in this thread.
@RaviHari Did you get round to starting on this?
+1
+1
Moreover, simply hitting "Terminate" is not always sufficient
I've also seen "Terminate" simply cause the sync operation to get stuck in "Terminating." This was in an app with ~1k resources.
If Ravi or anyone else puts up a PR, I'd be happy to review.
+1
Looking forward to this feature too. I have a lot of applications getting stuck and timeout would be great to not block the others resources that it's not related.
It seems like @RaviHari has lost interest in this, at least he stopped responding. We'd still appreciate that feature very much (we're using the app-of-apps pattern and sometimes it just gets stuck, and a timeout would really help). Any chance someone else can implement this?
To work around sync being stuck due to hooks or operations taking too long, I've implemented the following:
https://github.com/Sayrus/argo-cd/commit/817bc3449768021d0d5ad7f1ce7510bcd9d2f486
It's equivalent to clicking Terminate
after reaching the timeout. This will end up as a Sync Failed thus blocking self healing from auto syncing the application (Skipping auto-sync: failed previous sync attempt to xxxx
). This is probably not the best way to do it but it works.
Another way to work around it is to run the following as a CronJob.
from datetime import datetime, timedelta
import logging
import os
import sys
from kubernetes import client, config
import requests
logging.basicConfig(level=os.environ.get('LOG_LEVEL', 'info').upper(), format='[%(levelname)s] %(message)s')
try:
timeout_minutes = int(sys.argv[1])
except IndexError:
timeout_minutes = 60
argocd_server = os.environ['ARGOCD_SERVER']
argocd_token = os.environ['ARGOCD_TOKEN']
config.load_incluster_config()
api = client.CustomObjectsApi()
apps = api.list_namespaced_custom_object(
group='argoproj.io',
version='v1alpha1',
namespace='argocd',
plural='applications'
)['items']
syncing_apps = [app for app in apps if app.get('status', {}).get('operationState', {}).get('phase') == 'Running']
def apps_to_timeout():
now = datetime.utcnow()
logging.debug(f"Time now {now.isoformat()}")
for app in syncing_apps:
app_name = app['metadata']['name']
sync_started = datetime.fromisoformat(app['status']['operationState']['startedAt'].removesuffix('Z'))
logging.debug(f"App '{app_name}' started syncing at {sync_started.isoformat()}")
if now - sync_started > timedelta(minutes=timeout_minutes):
yield app_name
apps = list(apps_to_timeout())
logging.info(f"Number of apps syncing longer than timeout of {timeout_minutes} minutes: {len(apps)}")
session = requests.session()
session.cookies.set('argocd.token', argocd_token)
for app_name in apps:
session.delete(f"https://{argocd_server}/api/v1/applications/{app_name}/operation")
logging.info(f"Terminated sync operation for '{app_name}'")
@alexec would you mind giving a look to https://github.com/argoproj/argo-cd/pull/15603?