numaflow
numaflow copied to clipboard
Pipeline stuck on init container unable to create any buckets/buffers on nats ISB service
Describe the bug
Pipeline stuck on init container unable to create any buckets/buffers on nats ISB service.
There are other pipelines that are working fine on the same cluster.
There are no error logs in the ISBSVC pods
To Reproduce
- Haven't tried replicating it yet, a delete and re-create of the pipeline fixes the issue.
Expected behavior
Pipeline's should be able to create the bucket/buffers on Nats reliably.
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
- Kubernetes: v1.24
- Numaflow: v1.1.3
Additional context
Logs from source kafka vertex
{"level":"error","ts":"2024-01-17T08:54:57.957685911Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:68","msg":"Nats default: disconnected","pipeline":"atlan-nf-pipe-workflow-alerts","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/home/runner/work/numaflow/numaflow/pkg/shared/clients/nats/nats_client.go:68\ngithub.com/nats-io/nats%2ego.(*Conn).close.func1\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:5116\ngithub.com/nats-io/nats%2ego.(*asyncCallbacksHandler).asyncCBDispatcher\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:2901"}
{"level":"info","ts":"2024-01-17T08:54:57.957746396Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:62","msg":"Nats default: connection closed","pipeline":"atlan-nf-pipe-workflow-alerts"}
{"level":"info","ts":"2024-01-17T08:54:57.957705403Z","logger":"numaflow.isbsvc-validate","caller":"commands/isbsvc_validate.go:71","msg":"Buffers, buckets and side inputs store might have not been created yet, will retry if the limit is not reached","pipeline":"atlan-nf-pipe-workflow-alerts","error":"failed to query information of stream \"default-atlan-nf-pipe-workflow-alerts-filter-0\", nats: stream not found"}
{"level":"error","ts":"2024-01-17T08:57:46.780122031Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:68","msg":"Nats default: disconnected","pipeline":"atlan-nf-pipe-workflow-alerts","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/home/runner/work/numaflow/numaflow/pkg/shared/clients/nats/nats_client.go:68\ngithub.com/nats-io/nats%2ego.(*Conn).close.func1\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:5116\ngithub.com/nats-io/nats%2ego.(*asyncCallbacksHandler).asyncCBDispatcher\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:2901"}
{"level":"info","ts":"2024-01-17T08:57:46.780179312Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:62","msg":"Nats default: connection closed","pipeline":"atlan-nf-pipe-workflow-alerts"}
{"level":"info","ts":"2024-01-17T08:57:46.780142889Z","logger":"numaflow.isbsvc-validate","caller":"commands/isbsvc_validate.go:71","msg":"Buffers, buckets and side inputs store might have not been created yet, will retry if the limit is not reached","pipeline":"atlan-nf-pipe-workflow-alerts","error":"failed to query information of stream \"default-atlan-nf-pipe-workflow-alerts-filter-0\", nats: stream not found"}
{"level":"error","ts":"2024-01-17T09:03:17.725616679Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:68","msg":"Nats default: disconnected","pipeline":"atlan-nf-pipe-workflow-alerts","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/home/runner/work/numaflow/numaflow/pkg/shared/clients/nats/nats_client.go:68\ngithub.com/nats-io/nats%2ego.(*Conn).close.func1\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:5116\ngithub.com/nats-io/nats%2ego.(*asyncCallbacksHandler).asyncCBDispatcher\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:2901"}
{"level":"info","ts":"2024-01-17T09:03:17.725680124Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:62","msg":"Nats default: connection closed","pipeline":"atlan-nf-pipe-workflow-alerts"}
{"level":"info","ts":"2024-01-17T09:03:17.725633988Z","logger":"numaflow.isbsvc-validate","caller":"commands/isbsvc_validate.go:71","msg":"Buffers, buckets and side inputs store might have not been created yet, will retry if the limit is not reached","pipeline":"atlan-nf-pipe-workflow-alerts","error":"failed to query information of stream \"default-atlan-nf-pipe-workflow-alerts-filter-0\", nats: stream not found"}
Logs from daemon pod of pipeline
"level":"error","ts":"2024-01-17T09:02:36.536576383Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:68","msg":"Nats default: disconnected","pipeline":"atlan-nf-pipe-workflow-alerts","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/home/runner/work/numaflow/numaflow/pkg/shared/clients/nats/nats_client.go:68\ngithub.com/nats-io/nats%2ego.(*Conn).close.func1\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:5116\ngithub.com/nats-io/nats%2ego.(*asyncCallbacksHandler).asyncCBDispatcher\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:2901"}
{"level":"info","ts":"2024-01-17T09:02:36.53660976Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:62","msg":"Nats default: connection closed","pipeline":"atlan-nf-pipe-workflow-alerts"}
{"level":"info","ts":"2024-01-17T09:03:17.584331681Z","logger":"numaflow.isbsvc-validate","caller":"commands/isbsvc_validate.go:71","msg":"Buffers, buckets and side inputs store might have not been created yet, will retry if the limit is not reached","pipeline":"atlan-nf-pipe-workflow-alerts","error":"failed to query information of stream \"default-atlan-nf-pipe-workflow-alerts-filter-0\", nats: stream not found"}
{"level":"error","ts":"2024-01-17T09:03:17.584352748Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:68","msg":"Nats default: disconnected","pipeline":"atlan-nf-pipe-workflow-alerts","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/home/runner/work/numaflow/numaflow/pkg/shared/clients/nats/nats_client.go:68\ngithub.com/nats-io/nats%2ego.(*Conn).close.func1\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:5116\ngithub.com/nats-io/nats%2ego.(*asyncCallbacksHandler).asyncCBDispatcher\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:2901"}
{"level":"info","ts":"2024-01-17T09:03:17.584399933Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:62","msg":"Nats default: connection closed","pipeline":"atlan-nf-pipe-workflow-alerts"}
{"level":"error","ts":"2024-01-17T09:04:39.871876453Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:68","msg":"Nats default: disconnected","pipeline":"atlan-nf-pipe-workflow-alerts","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/home/runner/work/numaflow/numaflow/pkg/shared/clients/nats/nats_client.go:68\ngithub.com/nats-io/nats%2ego.(*Conn).close.func1\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:5116\ngithub.com/nats-io/nats%2ego.(*asyncCallbacksHandler).asyncCBDispatcher\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:2901"}
{"level":"info","ts":"2024-01-17T09:04:39.871931651Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:62","msg":"Nats default: connection closed","pipeline":"atlan-nf-pipe-workflow-alerts"}
{"level":"info","ts":"2024-01-17T09:04:39.871895049Z","logger":"numaflow.isbsvc-validate","caller":"commands/isbsvc_validate.go:71","msg":"Buffers, buckets and side inputs store might have not been created yet, will retry if the limit is not reached","pipeline":"atlan-nf-pipe-workflow-alerts","error":"failed to query information of stream \"default-atlan-nf-pipe-workflow-alerts-filter-0\", nats: stream not found"}
{"level":"error","ts":"2024-01-17T09:07:23.67368068Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:68","msg":"Nats default: disconnected","pipeline":"atlan-nf-pipe-workflow-alerts","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/home/runner/work/numaflow/numaflow/pkg/shared/clients/nats/nats_client.go:68\ngithub.com/nats-io/nats%2ego.(*Conn).close.func1\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:5116\ngithub.com/nats-io/nats%2ego.(*asyncCallbacksHandler).asyncCBDispatcher\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:2901"}
{"level":"info","ts":"2024-01-17T09:07:23.673737597Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:62","msg":"Nats default: connection closed","pipeline":"atlan-nf-pipe-workflow-alerts"}
{"level":"info","ts":"2024-01-17T09:07:23.673696963Z","logger":"numaflow.isbsvc-validate","caller":"commands/isbsvc_validate.go:71","msg":"Buffers, buckets and side inputs store might have not been created yet, will retry if the limit is not reached","pipeline":"atlan-nf-pipe-workflow-alerts","error":"failed to query information of stream \"default-atlan-nf-pipe-workflow-alerts-filter-0\", nats: stream not found"}
Every pod of the pipeline is stuck in the init container step
Message from the maintainers:
Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.
Could you please share the ISB (Jetstream) pod logs?
Please share the log in the isb creating job pod.
I managed to recreate a very similar error with the ISB service being temporarily down while creating the pipeline. I brought up the ISB service after I installed the pipeline
Edit: I haven't been able to replicate it since, it recovers properly. Will add more details as I find it.
Here's are the logs of the ISB service, the pipeline that I installed/experimented on has the following prefix atlan-nf-pipe-workflow-alerts
if the ISB is down, this behavior is expected.
Just FYI, we hadn't seen this with 0.8.1 but we're seeing this more regularly after moving to 1.1.3
The nats server is up without any errros but a random pipeline ends up in this state whenever we update the definition of the pipeline.
The errors are in the init container of every pod in the pipeline
Jan 19, 2024 @ 09:12:08.517 {"level":"error","ts":"2024-01-19T03:42:08.51677933Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:68","msg":"Nats default: disconnected","pipeline":"atlan-nf-pipe-email-alerts","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/home/runner/work/numaflow/numaflow/pkg/shared/clients/nats/nats_client.go:68\ngithub.com/nats-io/nats%2ego.(*Conn).close.func1\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:5116\ngithub.com/nats-io/nats%2ego.(*asyncCallbacksHandler).asyncCBDispatcher\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:2901"}
Jan 19, 2024 @ 09:12:08.517 {"level":"info","ts":"2024-01-19T03:42:08.517018894Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:62","msg":"Nats default: connection closed","pipeline":"atlan-nf-pipe-email-alerts"}
Jan 19, 2024 @ 09:12:08.517 {"level":"info","ts":"2024-01-19T03:42:08.516815616Z","logger":"numaflow.isbsvc-validate","caller":"commands/isbsvc_validate.go:71","msg":"Buffers, buckets and side inputs store might have not been created yet, will retry if the limit is not reached","pipeline":"atlan-nf-pipe-email-alerts","error":"failed to query information of stream \"default-atlan-nf-pipe-email-alerts-filter-0\", nats: stream not found"}
Jan 19, 2024 @ 09:12:07.374 panic: failed to query information of stream "default-atlan-nf-pipe-email-alerts-filter-0", nats: stream not found
Jan 19, 2024 @ 09:12:07.374 -
Jan 19, 2024 @ 09:12:07.374 goroutine 1 [running]:
Jan 19, 2024 @ 09:12:07.374 github.com/numaproj/numaflow/cmd/commands.Execute()
Jan 19, 2024 @ 09:12:07.374 /home/runner/work/numaflow/numaflow/cmd/commands/root.go:33 +0x45
Jan 19, 2024 @ 09:12:07.374 main.main()
I've combed through the ISB logs and there are no errors but also no mention of the affected pipelines either
@louisnow - It's expected to see more init container error after version v0.8.x - we update the buffer creating logic to wait a bit between each buffer creation to make sure the buffers are more evenly distributed in the jetstream nodes, thus the one time buffer creation job takes longer than before to finish. During that period time, the init container is expected to see those errors if the buffer has not been created. However it will keep retrying, and it retry over the limit, it crashes - again, k8s will restart the container and continue retrying.
In general, the init container errors do not matter, unless the buffers could not be created for different reasons.