numaflow icon indicating copy to clipboard operation
numaflow copied to clipboard

Pipeline stuck on init container unable to create any buckets/buffers on nats ISB service

Open louisnow opened this issue 1 year ago • 6 comments

Describe the bug

Pipeline stuck on init container unable to create any buckets/buffers on nats ISB service.

There are other pipelines that are working fine on the same cluster.

There are no error logs in the ISBSVC pods

To Reproduce

  1. Haven't tried replicating it yet, a delete and re-create of the pipeline fixes the issue.

Expected behavior

Pipeline's should be able to create the bucket/buffers on Nats reliably.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • Kubernetes: v1.24
  • Numaflow: v1.1.3

Additional context

Logs from source kafka vertex

{"level":"error","ts":"2024-01-17T08:54:57.957685911Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:68","msg":"Nats default: disconnected","pipeline":"atlan-nf-pipe-workflow-alerts","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/home/runner/work/numaflow/numaflow/pkg/shared/clients/nats/nats_client.go:68\ngithub.com/nats-io/nats%2ego.(*Conn).close.func1\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:5116\ngithub.com/nats-io/nats%2ego.(*asyncCallbacksHandler).asyncCBDispatcher\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:2901"}
{"level":"info","ts":"2024-01-17T08:54:57.957746396Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:62","msg":"Nats default: connection closed","pipeline":"atlan-nf-pipe-workflow-alerts"}
{"level":"info","ts":"2024-01-17T08:54:57.957705403Z","logger":"numaflow.isbsvc-validate","caller":"commands/isbsvc_validate.go:71","msg":"Buffers, buckets and side inputs store might have not been created yet, will retry if the limit is not reached","pipeline":"atlan-nf-pipe-workflow-alerts","error":"failed to query information of stream \"default-atlan-nf-pipe-workflow-alerts-filter-0\", nats: stream not found"}
{"level":"error","ts":"2024-01-17T08:57:46.780122031Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:68","msg":"Nats default: disconnected","pipeline":"atlan-nf-pipe-workflow-alerts","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/home/runner/work/numaflow/numaflow/pkg/shared/clients/nats/nats_client.go:68\ngithub.com/nats-io/nats%2ego.(*Conn).close.func1\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:5116\ngithub.com/nats-io/nats%2ego.(*asyncCallbacksHandler).asyncCBDispatcher\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:2901"}
{"level":"info","ts":"2024-01-17T08:57:46.780179312Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:62","msg":"Nats default: connection closed","pipeline":"atlan-nf-pipe-workflow-alerts"}
{"level":"info","ts":"2024-01-17T08:57:46.780142889Z","logger":"numaflow.isbsvc-validate","caller":"commands/isbsvc_validate.go:71","msg":"Buffers, buckets and side inputs store might have not been created yet, will retry if the limit is not reached","pipeline":"atlan-nf-pipe-workflow-alerts","error":"failed to query information of stream \"default-atlan-nf-pipe-workflow-alerts-filter-0\", nats: stream not found"}
{"level":"error","ts":"2024-01-17T09:03:17.725616679Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:68","msg":"Nats default: disconnected","pipeline":"atlan-nf-pipe-workflow-alerts","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/home/runner/work/numaflow/numaflow/pkg/shared/clients/nats/nats_client.go:68\ngithub.com/nats-io/nats%2ego.(*Conn).close.func1\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:5116\ngithub.com/nats-io/nats%2ego.(*asyncCallbacksHandler).asyncCBDispatcher\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:2901"}
{"level":"info","ts":"2024-01-17T09:03:17.725680124Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:62","msg":"Nats default: connection closed","pipeline":"atlan-nf-pipe-workflow-alerts"}
{"level":"info","ts":"2024-01-17T09:03:17.725633988Z","logger":"numaflow.isbsvc-validate","caller":"commands/isbsvc_validate.go:71","msg":"Buffers, buckets and side inputs store might have not been created yet, will retry if the limit is not reached","pipeline":"atlan-nf-pipe-workflow-alerts","error":"failed to query information of stream \"default-atlan-nf-pipe-workflow-alerts-filter-0\", nats: stream not found"}

Logs from daemon pod of pipeline

"level":"error","ts":"2024-01-17T09:02:36.536576383Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:68","msg":"Nats default: disconnected","pipeline":"atlan-nf-pipe-workflow-alerts","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/home/runner/work/numaflow/numaflow/pkg/shared/clients/nats/nats_client.go:68\ngithub.com/nats-io/nats%2ego.(*Conn).close.func1\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:5116\ngithub.com/nats-io/nats%2ego.(*asyncCallbacksHandler).asyncCBDispatcher\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:2901"}
{"level":"info","ts":"2024-01-17T09:02:36.53660976Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:62","msg":"Nats default: connection closed","pipeline":"atlan-nf-pipe-workflow-alerts"}
{"level":"info","ts":"2024-01-17T09:03:17.584331681Z","logger":"numaflow.isbsvc-validate","caller":"commands/isbsvc_validate.go:71","msg":"Buffers, buckets and side inputs store might have not been created yet, will retry if the limit is not reached","pipeline":"atlan-nf-pipe-workflow-alerts","error":"failed to query information of stream \"default-atlan-nf-pipe-workflow-alerts-filter-0\", nats: stream not found"}
{"level":"error","ts":"2024-01-17T09:03:17.584352748Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:68","msg":"Nats default: disconnected","pipeline":"atlan-nf-pipe-workflow-alerts","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/home/runner/work/numaflow/numaflow/pkg/shared/clients/nats/nats_client.go:68\ngithub.com/nats-io/nats%2ego.(*Conn).close.func1\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:5116\ngithub.com/nats-io/nats%2ego.(*asyncCallbacksHandler).asyncCBDispatcher\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:2901"}
{"level":"info","ts":"2024-01-17T09:03:17.584399933Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:62","msg":"Nats default: connection closed","pipeline":"atlan-nf-pipe-workflow-alerts"}
{"level":"error","ts":"2024-01-17T09:04:39.871876453Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:68","msg":"Nats default: disconnected","pipeline":"atlan-nf-pipe-workflow-alerts","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/home/runner/work/numaflow/numaflow/pkg/shared/clients/nats/nats_client.go:68\ngithub.com/nats-io/nats%2ego.(*Conn).close.func1\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:5116\ngithub.com/nats-io/nats%2ego.(*asyncCallbacksHandler).asyncCBDispatcher\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:2901"}
{"level":"info","ts":"2024-01-17T09:04:39.871931651Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:62","msg":"Nats default: connection closed","pipeline":"atlan-nf-pipe-workflow-alerts"}
{"level":"info","ts":"2024-01-17T09:04:39.871895049Z","logger":"numaflow.isbsvc-validate","caller":"commands/isbsvc_validate.go:71","msg":"Buffers, buckets and side inputs store might have not been created yet, will retry if the limit is not reached","pipeline":"atlan-nf-pipe-workflow-alerts","error":"failed to query information of stream \"default-atlan-nf-pipe-workflow-alerts-filter-0\", nats: stream not found"}
{"level":"error","ts":"2024-01-17T09:07:23.67368068Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:68","msg":"Nats default: disconnected","pipeline":"atlan-nf-pipe-workflow-alerts","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/home/runner/work/numaflow/numaflow/pkg/shared/clients/nats/nats_client.go:68\ngithub.com/nats-io/nats%2ego.(*Conn).close.func1\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:5116\ngithub.com/nats-io/nats%2ego.(*asyncCallbacksHandler).asyncCBDispatcher\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:2901"}
{"level":"info","ts":"2024-01-17T09:07:23.673737597Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:62","msg":"Nats default: connection closed","pipeline":"atlan-nf-pipe-workflow-alerts"}
{"level":"info","ts":"2024-01-17T09:07:23.673696963Z","logger":"numaflow.isbsvc-validate","caller":"commands/isbsvc_validate.go:71","msg":"Buffers, buckets and side inputs store might have not been created yet, will retry if the limit is not reached","pipeline":"atlan-nf-pipe-workflow-alerts","error":"failed to query information of stream \"default-atlan-nf-pipe-workflow-alerts-filter-0\", nats: stream not found"}

Every pod of the pipeline is stuck in the init container step

Message from the maintainers:

Impacted by this bug? Give it a 👍. We often sort issues this way to know what to prioritize.

louisnow avatar Jan 17 '24 10:01 louisnow

Could you please share the ISB (Jetstream) pod logs?

vigith avatar Jan 17 '24 16:01 vigith

Please share the log in the isb creating job pod.

whynowy avatar Jan 17 '24 16:01 whynowy

I managed to recreate a very similar error with the ISB service being temporarily down while creating the pipeline. I brought up the ISB service after I installed the pipeline

Edit: I haven't been able to replicate it since, it recovers properly. Will add more details as I find it.

Here's are the logs of the ISB service, the pipeline that I installed/experimented on has the following prefix atlan-nf-pipe-workflow-alerts

isb_1.log isb_2.log isb_3.log

louisnow avatar Jan 17 '24 16:01 louisnow

if the ISB is down, this behavior is expected.

vigith avatar Jan 17 '24 16:01 vigith

Just FYI, we hadn't seen this with 0.8.1 but we're seeing this more regularly after moving to 1.1.3

The nats server is up without any errros but a random pipeline ends up in this state whenever we update the definition of the pipeline.

The errors are in the init container of every pod in the pipeline

Jan 19, 2024 @ 09:12:08.517	{"level":"error","ts":"2024-01-19T03:42:08.51677933Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:68","msg":"Nats default: disconnected","pipeline":"atlan-nf-pipe-email-alerts","stacktrace":"github.com/numaproj/numaflow/pkg/shared/clients/nats.NewNATSClient.func3\n\t/home/runner/work/numaflow/numaflow/pkg/shared/clients/nats/nats_client.go:68\ngithub.com/nats-io/nats%2ego.(*Conn).close.func1\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:5116\ngithub.com/nats-io/nats%2ego.(*asyncCallbacksHandler).asyncCBDispatcher\n\t/home/runner/go/pkg/mod/github.com/nats-io/[email protected]/nats.go:2901"}
Jan 19, 2024 @ 09:12:08.517	{"level":"info","ts":"2024-01-19T03:42:08.517018894Z","logger":"numaflow.isbsvc-validate","caller":"nats/nats_client.go:62","msg":"Nats default: connection closed","pipeline":"atlan-nf-pipe-email-alerts"}
Jan 19, 2024 @ 09:12:08.517	{"level":"info","ts":"2024-01-19T03:42:08.516815616Z","logger":"numaflow.isbsvc-validate","caller":"commands/isbsvc_validate.go:71","msg":"Buffers, buckets and side inputs store might have not been created yet, will retry if the limit is not reached","pipeline":"atlan-nf-pipe-email-alerts","error":"failed to query information of stream \"default-atlan-nf-pipe-email-alerts-filter-0\", nats: stream not found"}
Jan 19, 2024 @ 09:12:07.374	panic: failed to query information of stream "default-atlan-nf-pipe-email-alerts-filter-0", nats: stream not found
Jan 19, 2024 @ 09:12:07.374	 - 
Jan 19, 2024 @ 09:12:07.374	goroutine 1 [running]:
Jan 19, 2024 @ 09:12:07.374	github.com/numaproj/numaflow/cmd/commands.Execute()
Jan 19, 2024 @ 09:12:07.374		/home/runner/work/numaflow/numaflow/cmd/commands/root.go:33 +0x45
Jan 19, 2024 @ 09:12:07.374	main.main()

I've combed through the ISB logs and there are no errors but also no mention of the affected pipelines either

louisnow avatar Jan 19 '24 04:01 louisnow

@louisnow - It's expected to see more init container error after version v0.8.x - we update the buffer creating logic to wait a bit between each buffer creation to make sure the buffers are more evenly distributed in the jetstream nodes, thus the one time buffer creation job takes longer than before to finish. During that period time, the init container is expected to see those errors if the buffer has not been created. However it will keep retrying, and it retry over the limit, it crashes - again, k8s will restart the container and continue retrying.

In general, the init container errors do not matter, unless the buffers could not be created for different reasons.

whynowy avatar Jan 19 '24 19:01 whynowy