argo-workflows
argo-workflows copied to clipboard
Workflow deadlocks on mutex in steps template if controller is restarted
Checklist
- [x] Double-checked my configuration
- [x] Tested using the latest version.
- [x] Used the Emissary executor.
Summary
What happened/what you expected to happen?
I expect my workflow to complete successfully even if the workflows controller is restarted. The same issue does not occur if the mutex is on the container template or at the workflow level.
What version are you running? v3.3.4 is what I've been using for this reproduction, but I've also been able to reproduce it in v3.3.3, v3.3.2, and v3.2.7 with a more complicated workflow.
Diagnostics
This is a workflow that reliably reproduces the issue. The steps I use are:
argo submit deadlock.yaml
sleep 5
kubectl delete pod -l app.kubernetes.io/component=workflow-controller
# deadlock.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: deadlock-test-
spec:
entrypoint: main-steps
templates:
- name: main-steps
synchronization:
mutex:
name: "mutex-{{workflow.name}}"
steps:
- - name: main
template: main-container
- name: main-container
container:
image: "busybox:latest"
command: ["sh"]
args: ["-c", "sleep 10; exit 0"]
Pod logs No pod logs from the main container since it just sleeps and exits.
Wait container logs
time="2022-05-09T12:57:42.216Z" level=info msg="Starting Workflow Executor" executorType=docker version=v3.3.2
time="2022-05-09T12:57:42.217Z" level=info msg="Creating a docker executor"
time="2022-05-09T12:57:42.217Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2022-05-09T12:57:42.217Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=kube-system podName=deadlock-test-bnbjc-938207750 template="{\"name\":\"main-container\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"busybox:latest\",\"command\":[\"sh\"],\"args\":[\"-c\",\"sleep 10; exit 0\"],\"resources\":{}}}" version="&Version{Version:v3.3.2,BuildDate:2022-04-20T21:41:09Z,GitCommit:35492a1700a0f279694cac874b6d9c07a08265d1,GitTag:v3.3.2,GitTreeState:clean,GoVersion:go1.17.9,Compiler:gc,Platform:linux/arm64,}"
time="2022-05-09T12:57:42.217Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:42.218Z" level=info msg="Starting deadline monitor"
time="2022-05-09T12:57:42.231Z" level=info msg="listed containers" containers="map[wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:42.231Z" level=info msg="mapped container name \"wait\" to container ID \"a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522\" (created at 2022-05-09 12:57:42 +0000 UTC, status Up)"
time="2022-05-09T12:57:43.232Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:43.248Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:43.248Z" level=info msg="mapped container name \"main\" to container ID \"0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862\" (created at 2022-05-09 12:57:43 +0000 UTC, status Up)"
time="2022-05-09T12:57:44.220Z" level=info msg="docker wait 0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862"
time="2022-05-09T12:57:44.248Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:44.269Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:45.270Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:45.287Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:46.288Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:46.323Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:47.325Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:47.350Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:48.351Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:48.380Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:49.380Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:49.409Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:50.411Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:50.449Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:51.450Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:51.490Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:52.491Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:52.536Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Up {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:53.289Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:53.310Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Exited {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:53.310Z" level=info msg="Main container completed"
time="2022-05-09T12:57:53.310Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2022-05-09T12:57:53.310Z" level=info msg="No output parameters"
time="2022-05-09T12:57:53.310Z" level=info msg="No output artifacts"
time="2022-05-09T12:57:53.310Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kube-system --filter=label=io.kubernetes.pod.name=deadlock-test-bnbjc-938207750"
time="2022-05-09T12:57:53.327Z" level=info msg="listed containers" containers="map[main:{0d69cf26ba8b09e5eec34de3f40471e175ade02a064113bdabfb44e85d063862 Exited {0 63787697863 <nil>}} wait:{a468147328ce593a751c8d3d6fb67c1b104090c2231e5690b843387b48c4c522 Up {0 63787697862 <nil>}}]"
time="2022-05-09T12:57:53.327Z" level=info msg="Killing sidecars []"
time="2022-05-09T12:57:53.327Z" level=info msg="Alloc=6106 TotalAlloc=10797 Sys=19410 NumGC=3 Goroutines=8"
Workflow controller logs
Before killing (old workflow controller pod): logs-before.txt
After killing (new workflow controller pod): logs-after.txt
Workflow resource
Looking at the workflow after a while, it shows a deadlock in the synchronization section (sorry this doesn't match the logs above, but it's the same structure every time). It's holding and waiting on the same lock.
synchronization:
mutex:
holding:
- holder: deadlock-test-4lp66
mutex: kube-system/Mutex/mutex-deadlock-test-4lp66
waiting:
- holder: kube-system/deadlock-test-4lp66
mutex: kube-system/Mutex/mutex-deadlock-test-4lp66
workflow-resource.txt (uploaded as txt cause GitHub doesn't allow .yaml)
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
I can still reproduce this issue using the same steps on v3.3.6
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
Still happening on v3.3.8 with the same steps
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
v3.3.8 is still the latest release so I think it's fair to say this bug is still relevant
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
I'm still able to reproduce this issue on v3.4.1 with the same steps
I had a look at this and I think this behaviour is fixable if "The same issue does not occur if the mutex is on the container template or at the workflow level." this is true.
@Gibstick would you mind providing an example where it does succeed? I am not entirely familiar with using mutexes in argo my self.
@sarabala1979 could you please assign this to me?
Thanks
Okay this was quite difficult to debug but I think I finally found the reason as to why this doesn't work.
Let's imagine we have a workflow name called "xyz" We launch "xyz" and then take a lock on "argo/xyz". Now the controller crashes. On restart instead of looking for a lock named "argo/xyz" we look for a lock named "argo/xyz/xyz".
There is something going on with the how the holderKey is generated.
I believe that fixing this should fix this bug.
In workflow/sync/sync_manager.go the behaviour of how mutexes are implemented looks wrong to me. I believe this is what is causing #8684.
This is the issue:
On restart, the key to acquire locks is called via the output of getResourceKey.
Please check func (cm *Manager) Initialize(wfs []wfv1.Workflow)
Despite this, if we look at how the Manger attempts to acquire a lock in func (cm *Manager) TryAcquire(wf *wfv1.Workflow, nodeName string, syncLockRef *wfv1.Synchronization) (bool, bool, string, error)
, we can see that it uses getHolderKey:
This results in the incorrect names being described above and why I believe this deadlock occurs. Luckily this is fairly easy to fix.
Could one of the maintainers please confirm which is the desired behaviour here please? My belief that getHolderKey is the correct one.
If they are both correct it could be the encoding and decoding from the configmap that is wrong.
func getHolderKey(wf *wfv1.Workflow, nodeName string) string {
if wf == nil {
return ""
}
key := fmt.Sprintf("%s/%s", wf.Namespace, wf.Name)
if nodeName != "" {
key = fmt.Sprintf("%s/%s", key, nodeName)
}
return key
}
func getResourceKey(namespace, wfName, resourceName string) string {
resourceKey := fmt.Sprintf("%s/%s", namespace, wfName)
if resourceName != wfName {
resourceKey = fmt.Sprintf("%s/%s", resourceKey, resourceName)
}
return resourceKey
}
@Gibstick would you mind providing an example where it does succeed? I am not entirely familiar with using mutexes in argo my self.
@isubasinghe I didn't test every scenario but the workaround we adopted was to move the synchronization to the top-level of the workflow spec:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: deadlock-test-
spec:
entrypoint: main-steps
# mutex moved up here
synchronization:
mutex:
name: "mutex-{{workflow.name}}"
templates:
- name: main-steps
steps:
- - name: main
template: main-container
- name: main-container
container:
image: "busybox:latest"
command: ["sh"]
args: ["-c", "sleep 10; exit 0"]
Thanks for that @Gibstick I will check why this works as well
@Gibstick can you try on v3.4.3?
@sarabala1979 mentioned today that latest
should fix this.
@isubasinghe and @Gibstick - can you please try the original workflow again on latest
?
I still see the deadlock on v3.4.3. I'm running image quay.io/argoproj/workflow-controller@sha256:156ed07e2b1ac9f23252fc4bc69a81daf88312dc3dfadb7c072980fcffdbf7f4
and I've deployed argo workflow with helm install my-argo-workflows argo/argo-workflows --version 0.20.5 --set images.tag="v3.4.3"
. Using the same steps but with a sleep of 5s, I still observe the same behaviour. The workflow completes the step but gets stuck, deadlocked on the mutex:
synchronization:
mutex:
holding:
- holder: deadlock-test-kch7v
mutex: default/Mutex/mutex-deadlock-test-kch7v
waiting:
- holder: default/deadlock-test-kch7v
mutex: default/Mutex/mutex-deadlock-test-kch7v
@caelan-io I can confirm this issue is still a problem unfortunately.
I believe ~~my original comment is still correct~~ something else is going on. The comment I made previously is now fixed in :latest, something else is causing it now. Looking into it again, will update when done.
https://github.com/isubasinghe/argo-workflows/tree/fix-deadlock-sync-manager
Please check before.txt and after.txt to see the logs after the deadlock.
before.txt and after.txt are the relevant files. After the controller restart, the mutex is acquired by argo/deadlock-test-sn8p5 as seen here: https://github.com/isubasinghe/argo-workflows/blob/fix-deadlock-sync-manager/after.txt#LL122C101-L122C125. Yet the holderKey is argo/deadlock-test-sn8p5/deadlock-test-sn8p5 as seen here: https://github.com/isubasinghe/argo-workflows/blob/fix-deadlock-sync-manager/after.txt#LL181C22-L181C67 This is what causes the deadlock
@isubasinghe @Gibstick Can you try it on v3.4.4?
Unfortunately I'm still able to reproduce it with the same steps. I made sure I was running the correct image version by kubectl describe
ing the pod and verifying the image quay.io/argoproj/workflow-controller@sha256:d06860f1394a94ac3ff8401126ef32ba28915aa6c3c982c7e607ea0b4dadb696
was indeed 3.4.4. The workflow yaml still shows the same deadlock.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
As far as I can tell this bug is still relevant
Update: we are working on a larger code refactor (#10191) that will resolve this issue, but it was delayed by the holiday season. We will update again asap.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
I think we are also hitting this issue - we have just upgraded to the latest argo 3.4.9
, if that's helpful. Right now our synchronization semaphore is configured like this:
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: template-name
spec:
templates:
- name: template-name
synchronization:
semaphore:
configMapKeyRef:
name: template-synchronization-config
key: template-key
container: # rest of the template config here.
I'm trying to decipher the comments in this issue to see if there's a different way we can have this synchronization in place that won't be affected while the refactor is going on.
Is there an easy way to clear out the semaphore locks on argo's side? Restarting the workflow-controller doesn't seem to be viable to recover.
We also had this over v3.5.2...the solution was to restart the controller.
It does mean we have a major issue with the workflows using the Mutex
feature.
Does a Semaphor
with value 1 would mitigate that?
@Bwvolleyball and @yossicohn
This is a pretty bad issue caused by the key used for the lock being different on acquire and release.
We are still wondering what the options could be that offers the least compromises. If you have opinions on this, feel free to address them here: #10267 Because imo that is the core issue.
I am thinking that a potential solution is to just dump this kinda data to the SQL database when that is available.