skaffold
skaffold copied to clipboard
Statefulsets need a moment to stabilize / not abort skaffold dev during launch
Expected behavior
Skaffold Dev starts, whilst allowing a little failure
Actual behavior
Skaffold sees an error and terminates everything
Information
- Skaffold version: 1.35
Steps to reproduce the behavior
- Have a statefulset (I can paste the contents of nats/stan if requested)
- skaffold dev
- statefulset/stan: container stan is backing off waiting to restart
- pod/stan-0: container stan is backing off waiting to restart
[stan-0 stan] [1] 2021/11/26 09:11:31.878330 [INF] STREAM: Starting nats-streaming-server[stan] version 0.16.2 [stan-0 stan] [1] 2021/11/26 09:11:31.878355 [INF] STREAM: ServerID: DDROjri7DXdxYBnHLqvfWF [stan-0 stan] [1] 2021/11/26 09:11:31.878356 [INF] STREAM: Go version: go1.11.13 [stan-0 stan] [1] 2021/11/26 09:11:31.878357 [INF] STREAM: Git commit: [910d6e1] [stan-0 stan] [1] 2021/11/26 09:11:31.881090 [INF] STREAM: Shutting down. [stan-0 stan] [1] 2021/11/26 09:11:31.881121 [FTL] STREAM: Failed to start: nats: no servers available for connection
- pod/stan-0: container stan is backing off waiting to restart
- statefulset/stan failed. Error: container stan is backing off waiting to restart.
As stan depends on nats this is very normal behaviour, simply die and try again, and thus 'impossible' to fix
Downgrading to < 1.35 instantly fixes the issue
Related to #4158, #6205 and in particular #6828
@gsquared94 can you add any information here regarding if this is intended behaviour from https://github.com/GoogleContainerTools/skaffold/pull/6828 and what possible short-term/long-term fixes there might for this issue?
@DGollings thanks for the issue. We added Statefulsets status check recently.
Looks like the ask is to ignore this failure.
Does skaffold dev exit on the first occurrence of this failure?
If not, have you tries using the statusCheckDeadlineSeconds
config field and bump the value ?
Looks like the ask is to ignore this failure. Does skaffold dev exit on the first occurrence of this failure?
yes
If not, have you tries using the statusCheckDeadlineSeconds config field and bump the value ?
was already 600 secs, but instantly dies
Assigning this to @aaron-prindle. They are looking into it.
We made a fix for auto-pilot cluster which got released in v2.0.0-beta2. Note: not available in cloud code.
the issue can be fixed by adding --tolerate-failures-until-deadline flag when running skaffold dev , implementation #8047