origin icon indicating copy to clipboard operation
origin copied to clipboard

Switch the deployment strategy based on external condition (PV type)

Open marekjelen opened this issue 8 years ago • 20 comments
trafficstars

Rolling strategy is not useful for for deployments with RWO PVs.

Version
oc v1.5.1+7b451fc
kubernetes v1.5.2+43a9be4
features: Basic-Auth
Steps To Reproduce
  1. Create RWO PV
  2. Assign the PV to a deployment with Rolling strategy
Current Result

When new deployment gets triggered, the deployment gets stuck.

Expected Result

The deployment strategy could be switched to Recreate to safe the user from the need to figure out the problem and then changing the strategy manually.

Additional Information

N/A

marekjelen avatar Jul 12 '17 16:07 marekjelen

@smarterclayton is it reasonable to emit a warning (event/condition/etc) saying that rolling with RWO will fail to roll? I don't think we should decide the strategy for the user automatically based on "external" inputs (like PVC type).

Also we can maybe fail the rollout before we actually create the deployer pod when we know in advance the rollout will fail (rolling + rwo).

@kargakis @tnozicka FYI

mfojtik avatar Jul 13 '17 10:07 mfojtik

Agreed with @mfojtik - we already do a lot of magic with triggers in the spec. I thought oc status would already emmit a warning for rolling deployments with RWO volumes, @marekjelen isn't that the case?

0xmichalis avatar Jul 13 '17 13:07 0xmichalis

@kargakis how about web console? //cc @jwforres

mfojtik avatar Jul 13 '17 13:07 mfojtik

@mfojtik you meant to ask @jwforres @spadgett ;)

0xmichalis avatar Jul 13 '17 13:07 0xmichalis

@kargakis i corrected myself ;P

mfojtik avatar Jul 13 '17 13:07 mfojtik

I don't think the console is showing a special warning for this today, but sounds like something to consider if we know its always going to fail.

jwforres avatar Jul 13 '17 14:07 jwforres

The problem from the perspective of the Overview is that we don't get PVC details at all today. PVCs are relatively stable, might be something we could just list, or slow poll. @spadgett probably other things we could be showing relative to PVCs used by deployments, like this deployment config references PVCs that are not bound?

jwforres avatar Jul 13 '17 14:07 jwforres

@jwforres as far as i remember when the RWO volumes are bound to a DC with rolling strategy we fail but the error is hidden in events and it is not really clear ;-) (you get some nasty storage error)...

Maybe time for: gsmarena_001

:-) "Looks like you have RWO volume with rolling strategy, do you want to change it?"

mfojtik avatar Jul 13 '17 14:07 mfojtik

I don't know that it's a warning necessary - it's totally valid to do this for a deployment. In fact, this is the correct way on openshift today to do a DB at scale 1 on AWS or gce. So warning is a bit much. But, it's probably something we should "inform" them of if they have scale > 1, and they'd might be better off with recreate for scale 1 (the advantage of rolling is that the new pod will complete the pull prior to the old pod going down)

On Thu, Jul 13, 2017 at 10:15 AM, Michal Fojtik [email protected] wrote:

@jwforres https://github.com/jwforres as far as i remember when the RWO volumes are bound to a DC with rolling strategy we fail but the error is hidden in events and it is not really clear ;-) (you get some nasty storage error)...

Maybe time for: [image: gsmarena_001] https://user-images.githubusercontent.com/44136/28170692-6ede73da-67e6-11e7-980d-8669c925065c.jpg

:-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/15168#issuecomment-315090693, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p7akE3ycPOcwt_LeuAVmuaZDrgXxks5sNiZzgaJpZM4OV49y .

smarterclayton avatar Jul 13 '17 14:07 smarterclayton

@smarterclayton if I have RWO PV and at the same time Rolling, the deployment gets always stuck, even with replicas=1. E.g. in online we do for persistent DBs with Recreate strategy by default, and so I went to Online Starter and took these screenshots after switching from Recreate to Rolling.

screen shot 2017-07-13 at 16 49 23

screen shot 2017-07-13 at 16 49 38

screen shot 2017-07-13 at 16 55 48

screen shot 2017-07-13 at 16 56 02

screen shot 2017-07-13 at 16 59 46

Amazon EBS and GCE based PVs only allow RWO mode and so if you set Rolling on a database deployment with a PV with these technologies you will never be able to trigger new deployment.

marekjelen avatar Jul 13 '17 15:07 marekjelen

Something else is wrong, that's not how the system should behave. Rolling deployment marks the old pod as deleted, which allows the cluster to detach the volume. You're likely hitting a bug you should be reporting to @bchilds

On Thu, Jul 13, 2017 at 11:00 AM, Marek Jelen [email protected] wrote:

@smarterclayton https://github.com/smarterclayton if I have RWO PV and at the same time Rolling, the deployment gets always stuck, even with replicas=1. E.g. in online we do for persistent DBs with Recreate strategy by default, and so I went to Online Starter and took these screenshots after switching from Recreate to Rolling.

[image: screen shot 2017-07-13 at 16 49 23] https://user-images.githubusercontent.com/156068/28172385-b446e7f4-67eb-11e7-8c7a-afd99d40480f.png

[image: screen shot 2017-07-13 at 16 49 38] https://user-images.githubusercontent.com/156068/28172393-b93d721e-67eb-11e7-8fdb-824e978e7e67.png

[image: screen shot 2017-07-13 at 16 55 48] https://user-images.githubusercontent.com/156068/28172557-27145ad2-67ec-11e7-96e9-b36f3069ae55.png

[image: screen shot 2017-07-13 at 16 56 02] https://user-images.githubusercontent.com/156068/28172561-2b0fc0e0-67ec-11e7-9df5-91f11855d6fe.png

[image: screen shot 2017-07-13 at 16 59 46] https://user-images.githubusercontent.com/156068/28172720-a7f12db0-67ec-11e7-9b6a-95360fff7ddb.png

Amazon EBS and GCE based PVs only allow RWO mode and so if you set Rolling on a database deployment with a PV with these technologies you will never be able to trigger new deployment.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/15168#issuecomment-315104372, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p32IjEeH7IFTiI-mApzpnb9nGboZks5sNjD8gaJpZM4OV49y .

smarterclayton avatar Jul 13 '17 15:07 smarterclayton

@smarterclayton that is interesting :) During rolling strategy there has to be two pods (for replicas=1), these two pods are with high probability running on two different machines, RWO can be attached to only one pod, usually the underlaying tech can be attached to only one machine. When I trigger redeploy and it would behave as you describe, I will will loose the PV from the original pod, however the application in that pod is not aware of that and can write into the PV, that should be there, but is not, as per your description is detached from the pod.

If Rollingis used with RWO volume I have to run into at least one of these two scenarios

  • my original pod does not have PV anymore and so any writes into that PV are inconsistent, however the app is not aware of that
  • two pods need to write to single PV that is not designed for multiple concurrent writes, which could lead into FS/storage corruption

marekjelen avatar Jul 13 '17 15:07 marekjelen

Something else is wrong, that's not how the system should behave. Rolling deployment marks the old pod as deleted, which allows the cluster to detach the volume. You're likely hitting a bug you should be reporting to @bchilds

@smarterclayton when is the old pod marked as deleted? AFAIU until there new version is not live and ready we can not mark the old pod as deleted (and detach the persistent storage) as it will still receive traffic, since the endpoint will be listed in the service. Once the new pod is ready, the old pod is marked as terminating, and the endpoint is removed from the service, but we can not still detach the storage since we need to wait for the graceful shutdown, else we could be introducing a lot of application errors. And I hope we're not.

jorgemoralespou avatar Jul 13 '17 15:07 jorgemoralespou

@smarterclayton can you please follow up on the issue? thanks

marekjelen avatar Jul 20 '17 09:07 marekjelen

It's unlike that we will automate any sort of spec mutation to handle this case. oc status should already warn in case you are running a Rolling deployment with a RWO volume. The only thing missing is a console warning?

0xmichalis avatar Jul 20 '17 09:07 0xmichalis

@kargakis yes

mfojtik avatar Jul 20 '17 09:07 mfojtik

@kargakis @mfojtik could the warning also be shown directly in oc deploy/rollout instead of being hidden in the oc status ?

Plus would like to get some clarification on what @smarterclayton says regarding the behaviour of RWO volumes, that is still confusing to me and I am not the only one who thinks the behaviour is supposed to be different then what @smarterclayton says.

marekjelen avatar Jul 20 '17 10:07 marekjelen

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot avatar Feb 15 '18 04:02 openshift-bot

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot avatar Mar 18 '18 05:03 openshift-bot

/lifecycle frozen

jorgemoralespou avatar Mar 21 '18 14:03 jorgemoralespou