argo-workflows Workflow deadlocks on mutex in steps template if controller is restarted

Workflow deadlocks on mutex in steps template if controller is restarted

Open Gibstick opened this issue 2 years ago • 28 comments

Checklist

[x] Double-checked my configuration
[x] Tested using the latest version.
[x] Used the Emissary executor.

Summary

What happened/what you expected to happen?

I expect my workflow to complete successfully even if the workflows controller is restarted. The same issue does not occur if the mutex is on the container template or at the workflow level.

What version are you running? v3.3.4 is what I've been using for this reproduction, but I've also been able to reproduce it in v3.3.3, v3.3.2, and v3.2.7 with a more complicated workflow.

Diagnostics

This is a workflow that reliably reproduces the issue. The steps I use are:

argo submit deadlock.yaml
sleep 5
kubectl delete pod -l app.kubernetes.io/component=workflow-controller

# deadlock.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: deadlock-test-
spec:
  entrypoint: main-steps

  templates:
    - name: main-steps
      synchronization:
        mutex:
          name: "mutex-{{workflow.name}}"
      steps:
        - - name: main
            template: main-container

    - name: main-container
      container:
        image: "busybox:latest"
        command: ["sh"]
        args: ["-c", "sleep 10; exit 0"]

Pod logs No pod logs from the main container since it just sleeps and exits.

Wait container logs

time="2022-05-09T12:57:42.216Z" level=info msg="Starting Workflow Executor" executorType=docker version=v3.3.2
time="2022-05-09T12:57:42.217Z" level=info msg="Creating a docker executor"
time="2022-05-09T12:57:42.217Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2022-05-09T12:57:42.217Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=kube-system podName=deadlock-test-bnbjc-938207750 template="{\"name\":\"main-container\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"busybox:latest\",\"command\":[\"sh\"],\"args\":[\"-c\",\"sleep 10; exit 0\"],\"resources\":{}}}" version="&Version{Version:v3.3.2,BuildDate:2022-04-20T21:41:09Z,GitCommit:35492a1700a0f279694cac874b6d9c07a08265d1,GitTag:v3.3.2,GitTreeState:clean,GoVersion:go1.17.9,Compiler:gc,Platform:linux/arm64,}"
time="2022-05-09T12:57:42.217Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:42.218Z" level=info msg="Starting deadline monitor"
time="2022-05-09T12:57:42.231Z" level=info msg="listed containers" containers="map[wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:42.231Z" level=info msg="mapped container name \"wait\" to container ID \"a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522\" (created at 2022-05-09 12:57:42 +0000 UTC, status Up)"
time="2022-05-09T12:57:43.232Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:43.248Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:43.248Z" level=info msg="mapped container name \"main\" to container ID \"0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862\" (created at 2022-05-09 12:57:43 +0000 UTC, status Up)"
time="2022-05-09T12:57:44.220Z" level=info msg="docker wait 0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862"
time="2022-05-09T12:57:44.248Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:44.269Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:45.270Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:45.287Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:46.288Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:46.323Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:47.325Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:47.350Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:48.351Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:48.380Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:49.380Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:49.409Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:50.411Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:50.449Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:51.450Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:51.490Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:52.491Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:52.536Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:53.289Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:53.310Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Exited {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:53.310Z" level=info msg="Main container completed"
time="2022-05-09T12:57:53.310Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2022-05-09T12:57:53.310Z" level=info msg="No output parameters"
time="2022-05-09T12:57:53.310Z" level=info msg="No output artifacts"
time="2022-05-09T12:57:53.310Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:53.327Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Exited {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:53.327Z" level=info msg="Killing sidecars []"
time="2022-05-09T12:57:53.327Z" level=info msg="Alloc=6106 TotalAlloc=10797 Sys=19410 NumGC=3 Goroutines=8"

Workflow controller logs

Before killing (old workflow controller pod): logs-before.txt

After killing (new workflow controller pod): logs-after.txt

Workflow resource

Looking at the workflow after a while, it shows a deadlock in the synchronization section (sorry this doesn't match the logs above, but it's the same structure every time). It's holding and waiting on the same lock.

  synchronization:
    mutex:
      holding:
      - holder: deadlock-test-4lp66
        mutex: kube-system/Mutex/mutex-deadlock-test-4lp66
      waiting:
      - holder: kube-system/deadlock-test-4lp66
        mutex: kube-system/Mutex/mutex-deadlock-test-4lp66

workflow-resource.txt (uploaded as txt cause GitHub doesn't allow .yaml)

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

May 09 '22 13:05 Gibstick

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

Jun 12 '22 12:06 stale[bot]

I can still reproduce this issue using the same steps on v3.3.6

Jun 14 '22 15:06 Gibstick

Jul 10 '22 06:07 stale[bot]

Still happening on v3.3.8 with the same steps

Jul 16 '22 22:07 Gibstick

Jul 31 '22 04:07 stale[bot]

v3.3.8 is still the latest release so I think it's fair to say this bug is still relevant

Jul 31 '22 22:07 Gibstick

Oct 01 '22 06:10 stale[bot]

I'm still able to reproduce this issue on v3.4.1 with the same steps

Oct 03 '22 14:10 Gibstick

I had a look at this and I think this behaviour is fixable if "The same issue does not occur if the mutex is on the container template or at the workflow level." this is true.

@Gibstick would you mind providing an example where it does succeed? I am not entirely familiar with using mutexes in argo my self.

@sarabala1979 could you please assign this to me?

Thanks

Oct 19 '22 07:10 isubasinghe

Okay this was quite difficult to debug but I think I finally found the reason as to why this doesn't work.

Let's imagine we have a workflow name called "xyz" We launch "xyz" and then take a lock on "argo/xyz". Now the controller crashes. On restart instead of looking for a lock named "argo/xyz" we look for a lock named "argo/xyz/xyz".

There is something going on with the how the holderKey is generated.

I believe that fixing this should fix this bug.

Oct 20 '22 09:10 isubasinghe

In workflow/sync/sync_manager.go the behaviour of how mutexes are implemented looks wrong to me. I believe this is what is causing #8684.

This is the issue: On restart, the key to acquire locks is called via the output of getResourceKey. Please check func (cm *Manager) Initialize(wfs []wfv1.Workflow)

Despite this, if we look at how the Manger attempts to acquire a lock in func (cm *Manager) TryAcquire(wf *wfv1.Workflow, nodeName string, syncLockRef *wfv1.Synchronization) (bool, bool, string, error), we can see that it uses getHolderKey:

This results in the incorrect names being described above and why I believe this deadlock occurs. Luckily this is fairly easy to fix.

Could one of the maintainers please confirm which is the desired behaviour here please? My belief that getHolderKey is the correct one.

If they are both correct it could be the encoding and decoding from the configmap that is wrong.

func getHolderKey(wf *wfv1.Workflow, nodeName string) string {
	if wf == nil {
		return ""
	}
	key := fmt.Sprintf("%s/%s", wf.Namespace, wf.Name)
	if nodeName != "" {
		key = fmt.Sprintf("%s/%s", key, nodeName)
	}
	return key
}

func getResourceKey(namespace, wfName, resourceName string) string {
	resourceKey := fmt.Sprintf("%s/%s", namespace, wfName)
	if resourceName != wfName {
		resourceKey = fmt.Sprintf("%s/%s", resourceKey, resourceName)
	}
	return resourceKey
}

Oct 21 '22 03:10 isubasinghe

@Gibstick would you mind providing an example where it does succeed? I am not entirely familiar with using mutexes in argo my self.

@isubasinghe I didn't test every scenario but the workaround we adopted was to move the synchronization to the top-level of the workflow spec:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: deadlock-test-
spec:
  entrypoint: main-steps
  # mutex moved up here
  synchronization:
    mutex:
      name: "mutex-{{workflow.name}}"

  templates:
    - name: main-steps
      steps:
        - - name: main
            template: main-container

    - name: main-container
      container:
        image: "busybox:latest"
        command: ["sh"]
        args: ["-c", "sleep 10; exit 0"]

Oct 22 '22 01:10 Gibstick

Thanks for that @Gibstick I will check why this works as well

Oct 24 '22 23:10 isubasinghe

@Gibstick can you try on v3.4.3?

Oct 31 '22 17:10 sarabala1979

@sarabala1979 mentioned today that latest should fix this.

@isubasinghe and @Gibstick - can you please try the original workflow again on latest?

Oct 31 '22 17:10 caelan-io

I still see the deadlock on v3.4.3. I'm running image quay.io/argoproj/workflow-controller@sha256:156ed07e2b1ac9f23252fc4bc69a81daf88312dc3dfadb7c072980fcffdbf7f4 and I've deployed argo workflow with helm install my-argo-workflows argo/argo-workflows --version 0.20.5 --set images.tag="v3.4.3". Using the same steps but with a sleep of 5s, I still observe the same behaviour. The workflow completes the step but gets stuck, deadlocked on the mutex:

  synchronization:
    mutex:
      holding:
      - holder: deadlock-test-kch7v
        mutex: default/Mutex/mutex-deadlock-test-kch7v
      waiting:
      - holder: default/deadlock-test-kch7v
        mutex: default/Mutex/mutex-deadlock-test-kch7v

Oct 31 '22 21:10 Gibstick

@caelan-io I can confirm this issue is still a problem unfortunately.

I believe ~~my original comment is still correct~~ something else is going on. The comment I made previously is now fixed in :latest, something else is causing it now. Looking into it again, will update when done.

Nov 01 '22 01:11 isubasinghe

https://github.com/isubasinghe/argo-workflows/tree/fix-deadlock-sync-manager

Please check before.txt and after.txt to see the logs after the deadlock.

before.txt and after.txt are the relevant files. After the controller restart, the mutex is acquired by argo/deadlock-test-sn8p5 as seen here: https://github.com/isubasinghe/argo-workflows/blob/fix-deadlock-sync-manager/after.txt#LL122C101-L122C125. Yet the holderKey is argo/deadlock-test-sn8p5/deadlock-test-sn8p5 as seen here: https://github.com/isubasinghe/argo-workflows/blob/fix-deadlock-sync-manager/after.txt#LL181C22-L181C67 This is what causes the deadlock

Nov 10 '22 04:11 isubasinghe

@isubasinghe @Gibstick Can you try it on v3.4.4?

Jan 05 '23 18:01 sarabala1979

Unfortunately I'm still able to reproduce it with the same steps. I made sure I was running the correct image version by kubectl describeing the pod and verifying the image quay.io/argoproj/workflow-controller@sha256:d06860f1394a94ac3ff8401126ef32ba28915aa6c3c982c7e607ea0b4dadb696 was indeed 3.4.4. The workflow yaml still shows the same deadlock.

Jan 05 '23 22:01 Gibstick

Jan 21 '23 20:01 stale[bot]

As far as I can tell this bug is still relevant

Jan 22 '23 13:01 Gibstick

Update: we are working on a larger code refactor (#10191) that will resolve this issue, but it was delayed by the holiday season. We will update again asap.

Jan 24 '23 00:01 caelan-io

Mar 25 '23 11:03 stale[bot]

I think we are also hitting this issue - we have just upgraded to the latest argo 3.4.9, if that's helpful. Right now our synchronization semaphore is configured like this:

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: template-name
spec:
  templates:
    - name: template-name
       synchronization:
         semaphore:
           configMapKeyRef:
             name: template-synchronization-config
             key: template-key
       container: # rest of the template config here.

I'm trying to decipher the comments in this issue to see if there's a different way we can have this synchronization in place that won't be affected while the refactor is going on.

Is there an easy way to clear out the semaphore locks on argo's side? Restarting the workflow-controller doesn't seem to be viable to recover.

Aug 24 '23 17:08 Bwvolleyball

We also had this over v3.5.2...the solution was to restart the controller. It does mean we have a major issue with the workflows using the Mutex feature. Does a Semaphor with value 1 would mitigate that?

Nov 30 '23 06:11 yossicohn

@Bwvolleyball and @yossicohn

This is a pretty bad issue caused by the key used for the lock being different on acquire and release.

We are still wondering what the options could be that offers the least compromises. If you have opinions on this, feel free to address them here: #10267 Because imo that is the core issue.

I am thinking that a potential solution is to just dump this kinda data to the SQL database when that is available.

Nov 30 '23 08:11 isubasinghe

argo-workflows argo-workflows copied to clipboard

Workflow deadlocks on mutex in steps template if controller is restarted

Checklist

Summary

Diagnostics

argo-workflows
argo-workflows copied to clipboard