numaflow icon indicating copy to clipboard operation
numaflow copied to clipboard

No "Create Job" executed - Possible Race Condition

Open juliev0 opened this issue 5 months ago • 9 comments

Describe the bug Normally, Numaplane's e2e test passes, but this captures an instance in which it did not.

The sequence of events which Numaplane did:

  1. Create ISBService
  2. Create Pipeline
  3. After Pipeline is running, update it in a trivial way that doesn't require pausing
  4. After Pipeline is updated, pause it so it can be updated
  5. Update Pipeline's topology from in->out to in->cat->out and keep its desiredPhase=Paused
  6. Once topology change is reconciled, update desiredPhase=Running

Normally, this works fine. Normally step 5 causes a Creation Job and a Deletion Job:

{"level":"info","ts":"2024-09-23T20:08:28.842186617Z","logger":"numaflow.controller-manager","caller":"pipeline/controller.go:337","msg":"Created a job successfully for ISB creating","namespace":"numaplane-system","pipeline":"test-pipeline-rollout","buffers":["numaplane-system-test-pipeline-rollout-cat-0"],"buckets":["numaplane-system-test-pipeline-rollout-cat-out","numaplane-system-test-pipeline-rollout-in-cat"],"servingStreams":[]}
{"level":"info","ts":"2024-09-23T20:08:28.850466284Z","logger":"numaflow.controller-manager","caller":"pipeline/controller.go:356","msg":"Created ISB Svc deleting job successfully","namespace":"numaplane-system","pipeline":"test-pipeline-rollout","buffers":[],"buckets":["numaplane-system-test-pipeline-rollout-in-out"]}

(above is extracted from a good run)

However, in this run the Creation Job was not executed, which caused the Daemon Pods unable to get past the init container isbsvc-validatecheck for buffers and buckets.

** What I suspect ** In the log I see that the in Vertex was successfully updated, the cat vertex was created, but when the out vertex was supposed to be created, a Resource Version conflict occurred here. This may have happened prior to the Creation Job being created. This should cause Numaflow Controller to return and then re-reconcile idempotently.

However, note that the Creation Job is dependent on newBuffers and newBuckets, which is dependent on these Vertex values. In this bug, the in and cat Vertices were updated successfully on the previous reconciliation. So, the current state of the Vertex no longer reflects the new buffers and new buckets which need to be added.

I will add logs.

Environment (please complete the following information):

  • Numaflow: quay.io/numaio/numaflow-rc:v0.0.12

Message from the maintainers:

Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.

For quick help and support, join our slack channel.

juliev0 avatar Sep 23 '24 21:09 juliev0