skaffold icon indicating copy to clipboard operation
skaffold copied to clipboard

Statefulsets need a moment to stabilize / not abort skaffold dev during launch

Open DGollings opened this issue 3 years ago • 5 comments

Expected behavior

Skaffold Dev starts, whilst allowing a little failure

Actual behavior

Skaffold sees an error and terminates everything

Information

  • Skaffold version: 1.35

Steps to reproduce the behavior

  1. Have a statefulset (I can paste the contents of nats/stan if requested)
  2. skaffold dev
  • statefulset/stan: container stan is backing off waiting to restart
    • pod/stan-0: container stan is backing off waiting to restart

      [stan-0 stan] [1] 2021/11/26 09:11:31.878330 [INF] STREAM: Starting nats-streaming-server[stan] version 0.16.2 [stan-0 stan] [1] 2021/11/26 09:11:31.878355 [INF] STREAM: ServerID: DDROjri7DXdxYBnHLqvfWF [stan-0 stan] [1] 2021/11/26 09:11:31.878356 [INF] STREAM: Go version: go1.11.13 [stan-0 stan] [1] 2021/11/26 09:11:31.878357 [INF] STREAM: Git commit: [910d6e1] [stan-0 stan] [1] 2021/11/26 09:11:31.881090 [INF] STREAM: Shutting down. [stan-0 stan] [1] 2021/11/26 09:11:31.881121 [FTL] STREAM: Failed to start: nats: no servers available for connection

  • statefulset/stan failed. Error: container stan is backing off waiting to restart.

As stan depends on nats this is very normal behaviour, simply die and try again, and thus 'impossible' to fix

Downgrading to < 1.35 instantly fixes the issue

Related to #4158, #6205 and in particular #6828

DGollings avatar Nov 26 '21 12:11 DGollings

@gsquared94 can you add any information here regarding if this is intended behaviour from https://github.com/GoogleContainerTools/skaffold/pull/6828 and what possible short-term/long-term fixes there might for this issue?

aaron-prindle avatar Nov 29 '21 17:11 aaron-prindle

@DGollings thanks for the issue. We added Statefulsets status check recently. Looks like the ask is to ignore this failure.
Does skaffold dev exit on the first occurrence of this failure? If not, have you tries using the statusCheckDeadlineSeconds config field and bump the value ?

tejal29 avatar Jan 10 '22 19:01 tejal29

Looks like the ask is to ignore this failure. Does skaffold dev exit on the first occurrence of this failure?

yes

If not, have you tries using the statusCheckDeadlineSeconds config field and bump the value ?

was already 600 secs, but instantly dies

DGollings avatar Jan 17 '22 14:01 DGollings

Assigning this to @aaron-prindle. They are looking into it.

tejal29 avatar May 09 '22 18:05 tejal29

We made a fix for auto-pilot cluster which got released in v2.0.0-beta2. Note: not available in cloud code.

tejal29 avatar Sep 02 '22 16:09 tejal29

the issue can be fixed by adding --tolerate-failures-until-deadline flag when running skaffold dev , implementation #8047

ericzzzzzzz avatar Nov 09 '22 19:11 ericzzzzzzz