origin
origin copied to clipboard
Switch the deployment strategy based on external condition (PV type)
Rolling strategy is not useful for for deployments with RWO PVs.
Version
oc v1.5.1+7b451fc
kubernetes v1.5.2+43a9be4
features: Basic-Auth
Steps To Reproduce
- Create RWO PV
- Assign the PV to a deployment with
Rollingstrategy
Current Result
When new deployment gets triggered, the deployment gets stuck.
Expected Result
The deployment strategy could be switched to Recreate to safe the user from the need to figure out the problem and then changing the strategy manually.
Additional Information
N/A
@smarterclayton is it reasonable to emit a warning (event/condition/etc) saying that rolling with RWO will fail to roll? I don't think we should decide the strategy for the user automatically based on "external" inputs (like PVC type).
Also we can maybe fail the rollout before we actually create the deployer pod when we know in advance the rollout will fail (rolling + rwo).
@kargakis @tnozicka FYI
Agreed with @mfojtik - we already do a lot of magic with triggers in the spec. I thought oc status would already emmit a warning for rolling deployments with RWO volumes, @marekjelen isn't that the case?
@kargakis how about web console? //cc @jwforres
@mfojtik you meant to ask @jwforres @spadgett ;)
@kargakis i corrected myself ;P
I don't think the console is showing a special warning for this today, but sounds like something to consider if we know its always going to fail.
The problem from the perspective of the Overview is that we don't get PVC details at all today. PVCs are relatively stable, might be something we could just list, or slow poll. @spadgett probably other things we could be showing relative to PVCs used by deployments, like this deployment config references PVCs that are not bound?
@jwforres as far as i remember when the RWO volumes are bound to a DC with rolling strategy we fail but the error is hidden in events and it is not really clear ;-) (you get some nasty storage error)...
Maybe time for:

:-) "Looks like you have RWO volume with rolling strategy, do you want to change it?"
I don't know that it's a warning necessary - it's totally valid to do this for a deployment. In fact, this is the correct way on openshift today to do a DB at scale 1 on AWS or gce. So warning is a bit much. But, it's probably something we should "inform" them of if they have scale > 1, and they'd might be better off with recreate for scale 1 (the advantage of rolling is that the new pod will complete the pull prior to the old pod going down)
On Thu, Jul 13, 2017 at 10:15 AM, Michal Fojtik [email protected] wrote:
@jwforres https://github.com/jwforres as far as i remember when the RWO volumes are bound to a DC with rolling strategy we fail but the error is hidden in events and it is not really clear ;-) (you get some nasty storage error)...
Maybe time for: [image: gsmarena_001] https://user-images.githubusercontent.com/44136/28170692-6ede73da-67e6-11e7-980d-8669c925065c.jpg
:-)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/15168#issuecomment-315090693, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p7akE3ycPOcwt_LeuAVmuaZDrgXxks5sNiZzgaJpZM4OV49y .
@smarterclayton if I have RWO PV and at the same time Rolling, the deployment gets always stuck, even with replicas=1. E.g. in online we do for persistent DBs with Recreate strategy by default, and so I went to Online Starter and took these screenshots after switching from Recreate to Rolling.





Amazon EBS and GCE based PVs only allow RWO mode and so if you set Rolling on a database deployment with a PV with these technologies you will never be able to trigger new deployment.
Something else is wrong, that's not how the system should behave. Rolling deployment marks the old pod as deleted, which allows the cluster to detach the volume. You're likely hitting a bug you should be reporting to @bchilds
On Thu, Jul 13, 2017 at 11:00 AM, Marek Jelen [email protected] wrote:
@smarterclayton https://github.com/smarterclayton if I have RWO PV and at the same time Rolling, the deployment gets always stuck, even with replicas=1. E.g. in online we do for persistent DBs with Recreate strategy by default, and so I went to Online Starter and took these screenshots after switching from Recreate to Rolling.
[image: screen shot 2017-07-13 at 16 49 23] https://user-images.githubusercontent.com/156068/28172385-b446e7f4-67eb-11e7-8c7a-afd99d40480f.png
[image: screen shot 2017-07-13 at 16 49 38] https://user-images.githubusercontent.com/156068/28172393-b93d721e-67eb-11e7-8fdb-824e978e7e67.png
[image: screen shot 2017-07-13 at 16 55 48] https://user-images.githubusercontent.com/156068/28172557-27145ad2-67ec-11e7-96e9-b36f3069ae55.png
[image: screen shot 2017-07-13 at 16 56 02] https://user-images.githubusercontent.com/156068/28172561-2b0fc0e0-67ec-11e7-9df5-91f11855d6fe.png
[image: screen shot 2017-07-13 at 16 59 46] https://user-images.githubusercontent.com/156068/28172720-a7f12db0-67ec-11e7-9b6a-95360fff7ddb.png
Amazon EBS and GCE based PVs only allow RWO mode and so if you set Rolling on a database deployment with a PV with these technologies you will never be able to trigger new deployment.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/15168#issuecomment-315104372, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p32IjEeH7IFTiI-mApzpnb9nGboZks5sNjD8gaJpZM4OV49y .
@smarterclayton that is interesting :) During rolling strategy there has to be two pods (for replicas=1), these two pods are with high probability running on two different machines, RWO can be attached to only one pod, usually the underlaying tech can be attached to only one machine. When I trigger redeploy and it would behave as you describe, I will will loose the PV from the original pod, however the application in that pod is not aware of that and can write into the PV, that should be there, but is not, as per your description is detached from the pod.
If Rollingis used with RWO volume I have to run into at least one of these two scenarios
- my original pod does not have PV anymore and so any writes into that PV are inconsistent, however the app is not aware of that
- two pods need to write to single PV that is not designed for multiple concurrent writes, which could lead into FS/storage corruption
Something else is wrong, that's not how the system should behave. Rolling deployment marks the old pod as deleted, which allows the cluster to detach the volume. You're likely hitting a bug you should be reporting to @bchilds
@smarterclayton when is the old pod marked as deleted? AFAIU until there new version is not live and ready we can not mark the old pod as deleted (and detach the persistent storage) as it will still receive traffic, since the endpoint will be listed in the service. Once the new pod is ready, the old pod is marked as terminating, and the endpoint is removed from the service, but we can not still detach the storage since we need to wait for the graceful shutdown, else we could be introducing a lot of application errors. And I hope we're not.
@smarterclayton can you please follow up on the issue? thanks
It's unlike that we will automate any sort of spec mutation to handle this case. oc status should already warn in case you are running a Rolling deployment with a RWO volume. The only thing missing is a console warning?
@kargakis yes
@kargakis @mfojtik could the warning also be shown directly in oc deploy/rollout instead of being hidden in the oc status ?
Plus would like to get some clarification on what @smarterclayton says regarding the behaviour of RWO volumes, that is still confusing to me and I am not the only one who thinks the behaviour is supposed to be different then what @smarterclayton says.
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle rotten /remove-lifecycle stale
/lifecycle frozen