argo-cd icon indicating copy to clipboard operation
argo-cd copied to clipboard

Timeout for sync operations

Open dominykas opened this issue 3 years ago • 6 comments

Summary

At the moment, if for whatever reason the sync process gets stuck (e.g. because some resource fails to start up properly and keeps on retrying), the sync will never complete and will keep on "Syncing".

There should be an option to add a timeout, after which the sync process would terminate. Depending on selfHeal rules, etc, there may be a need to automatically retry, or alternatively, the application should just stay in the failed state until manually resolved.

Did my best to search for similar requests, aside from a brief note in #1886, couldn't find anything - sorry if I missed it.

Motivation

At the moment, we've set up alerting for sync operations that are taking too long, which at least notifies someone to look at things and usually means a manual intervention.

When an application is in a "Syncing" state, manual intervention becomes rather tricky - one cannot delete resource to get them recreated (esp. when things are stuck in some sync wave), or perform a partial sync, etc.

Moreover, simply hitting "Terminate" is not always sufficient if the application has autosync enabled, as it would just retry, putting it into a forever "Syncing" state. Disabling autosync in some cases might also be problematic and require multiple steps, because it might be set from a parent application - which means that the parent application autosync also needs to be disabled (so that it does not just resync and re-enable the autosync).

Proposal

syncPolicy:
    syncTimeout: 600 # seconds, default: unlimited
    onSyncTimeout: "fail" # or "retry" (?), or "waitForUpdate" (?)

Some of the things that might need consideration:

  • Should selfHeal just retry? Or should that be configurable? The previous sync might not have completed in full, so hooks/postsync actions might not have executed.
  • Should new commits result in a new sync operation? Same as above, essentially. Arguably, new commits could be the fix.

dominykas avatar Apr 19 '21 08:04 dominykas

I would like to work on this issue.

RaviHari avatar Oct 20 '21 17:10 RaviHari

Is there any update on that?

hanzala1234 avatar Apr 13 '22 10:04 hanzala1234

@RaviHari is there any update on this. been in this issue because pre-hook failed and its locked to always sync state

prima101112 avatar Apr 20 '22 09:04 prima101112

@prima101112 and @hanzala1234 sorry for delay.. I will get started on this and keep you posted in this thread.

RaviHari avatar Apr 20 '22 10:04 RaviHari

@RaviHari Did you get round to starting on this?

LS80 avatar Jun 09 '22 14:06 LS80

+1

grezar avatar Jul 29 '22 08:07 grezar

+1

yabeenico avatar Aug 12 '22 04:08 yabeenico

Moreover, simply hitting "Terminate" is not always sufficient

I've also seen "Terminate" simply cause the sync operation to get stuck in "Terminating." This was in an app with ~1k resources.

If Ravi or anyone else puts up a PR, I'd be happy to review.

crenshaw-dev avatar Aug 12 '22 14:08 crenshaw-dev

+1

pritam-acquia avatar Oct 28 '22 11:10 pritam-acquia

Looking forward to this feature too. I have a lot of applications getting stuck and timeout would be great to not block the others resources that it's not related.

mhonorio avatar Nov 29 '22 15:11 mhonorio

It seems like @RaviHari has lost interest in this, at least he stopped responding. We'd still appreciate that feature very much (we're using the app-of-apps pattern and sometimes it just gets stuck, and a timeout would really help). Any chance someone else can implement this?

neiser avatar Jan 03 '23 10:01 neiser

To work around sync being stuck due to hooks or operations taking too long, I've implemented the following:

https://github.com/Sayrus/argo-cd/commit/817bc3449768021d0d5ad7f1ce7510bcd9d2f486

It's equivalent to clicking Terminate after reaching the timeout. This will end up as a Sync Failed thus blocking self healing from auto syncing the application (Skipping auto-sync: failed previous sync attempt to xxxx). This is probably not the best way to do it but it works.

Sayrus avatar Feb 21 '23 15:02 Sayrus

Another way to work around it is to run the following as a CronJob.

from datetime import datetime, timedelta
import logging
import os
import sys

from kubernetes import client, config
import requests

logging.basicConfig(level=os.environ.get('LOG_LEVEL', 'info').upper(), format='[%(levelname)s] %(message)s')

try:
    timeout_minutes = int(sys.argv[1])
except IndexError:
    timeout_minutes = 60

argocd_server = os.environ['ARGOCD_SERVER']
argocd_token = os.environ['ARGOCD_TOKEN']

config.load_incluster_config()

api = client.CustomObjectsApi()

apps = api.list_namespaced_custom_object(
    group='argoproj.io',
    version='v1alpha1',
    namespace='argocd',
    plural='applications'
)['items']

syncing_apps = [app for app in apps if app.get('status', {}).get('operationState', {}).get('phase') == 'Running']

def apps_to_timeout():
    now = datetime.utcnow()
    logging.debug(f"Time now {now.isoformat()}")

    for app in syncing_apps:
        app_name = app['metadata']['name']
        sync_started = datetime.fromisoformat(app['status']['operationState']['startedAt'].removesuffix('Z'))
        logging.debug(f"App '{app_name}' started syncing at {sync_started.isoformat()}")

        if now - sync_started > timedelta(minutes=timeout_minutes):
            yield app_name

apps = list(apps_to_timeout())
logging.info(f"Number of apps syncing longer than timeout of {timeout_minutes} minutes: {len(apps)}")

session = requests.session()
session.cookies.set('argocd.token', argocd_token)

for app_name in apps:
    session.delete(f"https://{argocd_server}/api/v1/applications/{app_name}/operation")
    logging.info(f"Terminated sync operation for '{app_name}'")

LS80 avatar Oct 09 '23 18:10 LS80

@alexec would you mind giving a look to https://github.com/argoproj/argo-cd/pull/15603?

aslafy-z avatar Oct 09 '23 21:10 aslafy-z