skaffold icon indicating copy to clipboard operation
skaffold copied to clipboard

skaffold dev should not fail on first failed build / deploy

Open balopat opened this issue 5 years ago • 8 comments

I propose that starting skaffold dev should start the dev loop, even if the first build or deployment fails.

I see no additional value in keeping the first build failed. In the end what will the developer do to improve the situation and "finally start 'dev'-ing"? They will fix the errors and rerun skaffold dev. Why can't they do that while it's still running and get feedback faster?

I argue that we can remove this artificial need for an extra skaffold dev command. If the user wants to stop skaffold dev-ing as they realize that it will take longer for them to fix things up, they can always Ctrl+C out of skaffold dev.

balopat avatar May 12 '20 18:05 balopat

I intend to work on this this week for the fixit effort - please comment if you feel strongly against it.

balopat avatar May 12 '20 18:05 balopat

More context on this: #516 Also, failure might be due to infrastructure issues, not just application related ones, for example Docker is not installed, or the user is pointed to the wrong kubecontext, etc. The original motivation was around that - we didn't stop, just warned at build errors and that created all sorts of messes, but today we do stop at failed builds.

balopat avatar May 12 '20 20:05 balopat

As I'm thinking about this, maybe it is a good idea to fail the first iteration on infrastructure errors. This would require us to start to differentiate a bit more smarter around error types, namely infra vs application errors - which is actually something @tejal29 started already.

balopat avatar May 13 '20 23:05 balopat

I realized that this is a bit more subtle: one issue that we'll have to resolve if we go down this route: currently skaffold assumes that everything was built & deployed once at least as a baseline, then every file change triggers only an incremental change on top of that.

Now, if we want to change this behavior, that means that we'll need to keep track of which artifacts haven't been built yet. Skaffold shouldn't deploy before all artifacts are built.

balopat avatar May 14 '20 23:05 balopat

More context on this: #516 Also, failure might be due to infrastructure issues, not just application related ones, for example Docker is not installed, or the user is pointed to the wrong kubecontext, etc. The original motivation was around that - we didn't stop, just warned at build errors and that created all sorts of messes, but today we do stop at failed builds.

@balopat there might be ImagePullBackOff errors too.

i've to run skaffold dev --no-prune=true --cleanup=false when developing on local.

in prototype stage, i intend to keep images on local, do not push to any registry. no need to prune and cleanup

maybe just set default value for --no-prune and --cleanup to true, false

or save these configs to ~/.skaffold/config like this:

global:
  survey:
    last-prompted: "2020-10-19T20:10:03+08:00"
kubeContexts: []
dev:
    no-prune: true
    cleanup: false

dfang avatar Oct 27 '20 01:10 dfang

From #4953, we should provide a flag to configure this behavior.

tejal29 avatar Oct 27 '20 21:10 tejal29

+1 to configurable flag

We are currently running into this issue as well. Many of our pods depend on Postgres DB being ready, and were designed to fail so they can be restarted. The current skaffold dev behavior of terminating all pods on failure makes it unusable for us.

legopin avatar Mar 10 '21 06:03 legopin

We're also in strong need for this feature. It would make test driven development way more easy.

fabifrank avatar Oct 14 '21 18:10 fabifrank

Recently a new feature enabled via --tolerate-failures-until-deadline and deploy.tolerateFailuresUntilDeadline=true (in skaffold.yaml) config allows for dev (as well as run, apply, and deploy) to not fail when a deployment encounters an error but instead keep polling for success until statusCheckDeadlineSeconds or the k8s object controllers own timeout (eg: deployments -> progressDeadlineSeconds). Using this flag should at least help solve the first failed deploy part of this for users here

aaron-prindle avatar Nov 09 '22 19:11 aaron-prindle