numaflow
numaflow copied to clipboard
No "Create Job" executed - Possible Race Condition
Describe the bug Normally, Numaplane's e2e test passes, but this captures an instance in which it did not.
The sequence of events which Numaplane did:
- Create ISBService
- Create Pipeline
- After Pipeline is running, update it in a trivial way that doesn't require pausing
- After Pipeline is updated, pause it so it can be updated
- Update Pipeline's topology from
in->out
toin->cat->out
and keep itsdesiredPhase=Paused
- Once topology change is reconciled, update
desiredPhase=Running
Normally, this works fine. Normally step 5 causes a Creation Job and a Deletion Job:
{"level":"info","ts":"2024-09-23T20:08:28.842186617Z","logger":"numaflow.controller-manager","caller":"pipeline/controller.go:337","msg":"Created a job successfully for ISB creating","namespace":"numaplane-system","pipeline":"test-pipeline-rollout","buffers":["numaplane-system-test-pipeline-rollout-cat-0"],"buckets":["numaplane-system-test-pipeline-rollout-cat-out","numaplane-system-test-pipeline-rollout-in-cat"],"servingStreams":[]}
{"level":"info","ts":"2024-09-23T20:08:28.850466284Z","logger":"numaflow.controller-manager","caller":"pipeline/controller.go:356","msg":"Created ISB Svc deleting job successfully","namespace":"numaplane-system","pipeline":"test-pipeline-rollout","buffers":[],"buckets":["numaplane-system-test-pipeline-rollout-in-out"]}
(above is extracted from a good run)
However, in this run the Creation Job was not executed, which caused the Daemon Pods unable to get past the init container isbsvc-validate
check for buffers and buckets.
** What I suspect **
In the log I see that the in
Vertex was successfully updated, the cat
vertex was created, but when the out
vertex was supposed to be created, a Resource Version conflict occurred here. This may have happened prior to the Creation Job being created. This should cause Numaflow Controller to return and then re-reconcile idempotently.
However, note that the Creation Job is dependent on newBuffers and newBuckets, which is dependent on these Vertex values. In this bug, the in
and cat
Vertices were updated successfully on the previous reconciliation. So, the current state of the Vertex no longer reflects the new buffers and new buckets which need to be added.
I will add logs.
Environment (please complete the following information):
- Numaflow: quay.io/numaio/numaflow-rc:v0.0.12
Message from the maintainers:
Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.
For quick help and support, join our slack channel.