atomicapp Implement atomic deployments for Nulecule application. Fixes #421

When there's an error during running a Nulecule application, rollback the changes made by stopping the application.

This pull request just takes care of invoking stop on the main Nulecule application and not refactoring the internal implementation of stop.

Dec 15 '15 07:12 rtnpro

Is this approach better than doing the undeploy inside of the provider files? So inside of the deploy() function for each provider, if error, then call undeploy?

Right now you have it all the way up at the NuleculeManager level.. that is ok, just wondering where the best place to do this is.

Dec 15 '15 14:12 dustymabe

Also, if we go with this approach then it affects all providers. Do you mind testing it with all providers and check to make sure it works?

Dec 15 '15 14:12 dustymabe

@dustymabe

Is this approach better than doing the undeploy inside of the provider files? So inside of the deploy() > function for each provider, if error, then call undeploy?

I had thought of it as well. The reason I did not go ahead with it was because, we wanted to undeploy the entire Nulecule application and not just that component which failed to get deployed.

Right now you have it all the way up at the NuleculeManager level.. that is ok, just wondering where the best place to do this is.

Because of the above explanation, this is intended.

Also, if we go with this approach then it affects all providers. Do you mind testing it with all providers and check to make sure it works?

:+1: with this. I need to test it on the other providers as well.

Dec 15 '15 15:12 rtnpro

Let me know when other providers have been tested. Also we need to fix unit tests.

We will postpone the merge of this until after tomorrow's release.

Dec 15 '15 17:12 dustymabe

So.. the provider undeploy() functions really need to have "ignore_errors" set as well. Essentially where we have it now is outside of undeply(). Take kubernetes for example; a "componenth" could have multiple artifacts which will have multiple calls to kubernetes api. If the first call fails then subsequent artifacts won't get removed.

undeploy() needs to know to ignore errors and attempt to remove each artifact.

Dec 18 '15 15:12 dustymabe

@dustymabe

undeploy() needs to know to ignore errors and attempt to remove each artifact.

Good point :+1:

Dec 18 '15 16:12 rtnpro

@rtnpro now that stop for openshift is in can you update undeploy() for it as well?

Jan 04 '16 20:01 dustymabe

ping @rtnpro ^^

Jan 06 '16 03:01 dustymabe

@dustymabe aye!

Jan 06 '16 04:01 rtnpro

@dustymabe Pushed changes for atomic deployment on Openshift provider as well.

Jan 11 '16 10:01 rtnpro

@rtnpro I believe everything LGTM. Since we do have 4 providers and there probably isn't much code can you go ahead and do this for marathon as well?

I should be able to test this stuff out tomorrow on docker/kubernets. I hopefully will get an openshift setup working again tomorrow to test on that as well. In the meantime can you confirm that you have tested that this works (failure on run.. yields a stop).

Jan 12 '16 05:01 dustymabe

ok. after running through some testing on this I'm not convinced this is the best path to take. We might be able to go with this but I think we can do better.

The sticking point I am on now is the state of the system when we start to deploy; after the failed deployment the system should be in the same state it was when we started deployment. One example of how to make our code "fail" and roll back is to have a half deployed application. Essentially part of the application will successfully deploy until it gets to the point at which it tries to deploy an artifact that already exists. That will fail and then we will "roll back" by undeploying the application.

The problem with this is that the undeploy will remove the service that existed before we ran our code, since it removes all artifacts. Is this OK?

I think a better approach my be to, on deploy, run through all artifacts first to see if they already exist. If all artifacts pass the "exists" test then we can run through them all and create them.

Considering everything I have written the change I am proposing (don't start deploy until checked that no artifacts already exist) could actually be done in a separate PR. That PR would take care of the failure case.

Jan 12 '16 20:01 dustymabe

I think a better approach my be to, on deploy, run through all artifacts first to see if they already exist. If all artifacts pass the "exists" test then we can run through them all and create them.

Considering everything I have written the change I am proposing (don't start deploy until checked that no artifacts already exist) could actually be done in a separate PR. That PR would take care of the failure case.

Opened https://github.com/projectatomic/atomicapp/issues/501 for this.

Jan 12 '16 21:01 dustymabe

@rtnpro can you address the remaining items in this PR? Don't worry about #501 for now.

Jan 15 '16 05:01 dustymabe

I tested this with couple of my examples on Marathon and OpenShift. Everything looks fine :+1:

~~Only thing I have missing is returning error after roll back as @dustymabe mentioned, otherwise~~ LGTM

Jan 19 '16 15:01 kadel

Let's postpone merging this PR til after tomorrows release.

Jan 19 '16 16:01 dustymabe

heads up that all tests pass for this :+1:

although this PR needs rebasing now due to time that's passed

Feb 08 '16 15:02 cdrage

@cdrage @dustymabe rebased!

Feb 09 '16 15:02 rtnpro

Other than my one comment, LGTM and tests have passed!

Feb 10 '16 13:02 cdrage

Postponing until after GA as we are going to limit our changes to bugfixes.

Feb 11 '16 16:02 dustymabe

Just an update, we are still planning on implementing this, although focus at the moment has been to converting the docker and k8s providers to their respective API implementations.

Apr 19 '16 20:04 cdrage