hash/fnv has collision cause the wait container process not as expected when create workflow as same name
Checklist
- [ ] Double-checked my configuration.
- [ ] Tested using the latest version.
- [ ] Used the Emissary executor.
Summary
What happened/what you expected to happen? wait container can wait correct main container What version are you running? v3.3.8
Diagnostics
every day, we has more than 1000000 data process task process by argo workflow, use workflowtemplate to create workflow as generateName to dynamic generate workflow name, sometime we found less workflow step's pod is NotReady, we check the wait container log, we found the node has same pod name's wait and main container wthis Exit(0) status, so when the same pod of new created on the node, and the new pod's wait container up, the wait container to use command docker ps --all --no-trunc --format='{{.Status}}|{{.Label "io.kubernetes.container.name"}}|{{.ID}}|{{.CreatedAt}}' --filter=label=io.kubernetes.pod.namespace=argo --filter=label=io.kubernetes.pod.name=PodName to found main container, the new pod main conatiner has not been created, the command get the old pod's main, but it's already exit 0, so the new pod's wait container do the finish the rest of logic quikly and succeeded exit 0. but late the new pod's main container created and up.
# Logs from the workflow controller:
kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
# If the workflow's pods have not been created, you can skip the rest of the diagnostics.
# The workflow's pods that are problematic:
kubectl get pod -o yaml -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
# Logs from in your workflow's wait container, something like:
time="2022-07-08T07:44:38.641Z" level=info msg="Starting deadline monitor" time="2022-07-08T07:44:38.670Z" level=info msg="mapped container name "main" to container ID "798a0c237789c0efc2c214a859d53195b55682c4645cf709ac8cb8f679ee6f9e" (created at 2022-07-06 02:00:25 +0000 UTC, status Exited)" time="2022-07-08T07:44:38.670Z" level=info msg="mapped container name "wait" to container ID "4c9e4492375a2418e282ec49c5f0889fa4d42d1073f3fb22d74ce442fe2689b2" (created at 2022-07-08 07:44:38 +0000 UTC, status Up)" time="2022-07-08T07:44:38.670Z" level=info msg="mapped container name "wait" to container ID "237b4d73b9059ed1f2b30192c24b1c3062b70d5d2b3cf90e6c70542d09ea760b" (created at 2022-07-06 02:00:25 +0000 UTC, status Exited)" time="2022-07-08T07:44:39.665Z" level=info msg="Main container completed" time="2022-07-08T07:44:39.665Z" level=info msg="No Script output reference in workflow. Capturing script output ignored" time="2022-07-08T07:44:39.665Z" level=info msg="Capturing script exit code" time="2022-07-08T07:44:39.686Z" level=info msg="Saving logs" time="2022-07-08T07:44:39.686Z" level=info msg="[docker logs 798a0c237789c0efc2c214a859d53195b55682c4645cf709ac8cb8f679ee6f9e]" time="2022-07-08T07:44:39.693Z" level=info msg="mapped container name "main" to container ID "48b5a6924c96c6cc42e043535b5fa0d3081135f77615efd98a8aa46c8f45a802" (created at 2022-07-08 07:44:38 +0000 UTC, status Up)" time="2022-07-08T07:44:39.693Z" level=info msg="mapped container name "wait" to container ID "4c9e4492375a2418e282ec49c5f0889fa4d42d1073f3fb22d74ce442fe2689b2" (created at 2022-07-08 07:44:38 +0000 UTC, status Up)" time="2022-07-08T07:44:39.992Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: /*/main.log" time="2022-07-08T07:44:39.992Z" level=info msg="Creating minio client ******** using static credentials" time="2022-07-08T07:44:39.992Z" level=info msg="Saving from **************** time="2022-07-08T07:44:40.475Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log time="2022-07-08T07:44:40.475Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log" time="2022-07-08T07:44:40.475Z" level=info msg="No output parameters" time="2022-07-08T07:44:40.475Z" level=info msg="No output artifacts" time="2022-07-08T07:44:40.475Z" level=info msg="Annotating pod with output" time="2022-07-08T07:44:40.584Z" level=info msg="Patch pods 200" time="2022-07-08T07:44:40.619Z" level=info msg="Killing sidecars []" time="2022-07-08T07:44:40.619Z" level=info msg="Alloc=5666 TotalAlloc=12662 Sys=76625 NumGC=5 Goroutines=12"
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
@alexec docker ps cmd should use pod uid as filter label: --filter=label=io.kubernetes.pod.uid=podUid
The Docker executor has been deprecated and is no longer supported. Please try the Emissary executor.
@huangjiasingle let us know emissary executor works for you
@sarabala1979 l can't use emissary executor to do in prod env, because every day we has more than 1000000 prod data process task to do with argo workflow, I haven't tested it with emissary executor for a lot of tasks. first of all, I need to ensure the stability of production task processing.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
This issue has been closed due to inactivity. Feel free to re-open if you still encounter this issue.