controller
controller copied to clipboard
Scaling an app down while a build is running leads to unpredictable results
From @jeff-lee on November 5, 2015 22:47
I'm running into an issue in v1.12.0 where scaling down an app while a build is running can result in either:
a) The new containers getting shut down and the build hanging b) Zero running containers
I started a new cluster and scaled the example-go app up to 3.
$ fleetctl list-units|grep jefftest jefftest_v74.web.1.service a5ea5dc1.../10.10.17.144 active running jefftest_v74.web.2.service 6b548706.../10.10.19.9 active running jefftest_v74.web.3.service 6b548706.../10.10.19.9 active running
I then started a build ( v75 ) and scaled the app down from 3 to 2 when the node started pulling the new containers down.
$ deis ps:scale web=2 -a jefftest Scaling processes... but first, coffee! done in 5s === jefftest Processes --- web: web.1 up (v74) web.2 up (v74)
At this point, the v75 container gets stopped and the build ( with HEALTHCHECK_URL set ) hangs.
Thu Nov 5 22:06:45 UTC 2015 cda30e1fda8e 10.10.16.243:5000/jefftest:v75 "/runner/init start 1 seconds ago Up Less than a second 0.0.0.0:32901->5000/tcp jefftest_v75.web.1 2598c80e0985 10.10.16.243:5000/jefftest:v74 "/runner/init start About a minute ago Up About a minute 0.0.0.0:32900->5000/tcp jefftest_v74.web.3 9d4614e6fb3f 10.10.16.243:5000/jefftest:v74 "/runner/init start 2 minutes ago Up 2 minutes 0.0.0.0:32899->5000/tcp jefftest_v74.web.2 Thu Nov 5 22:06:46 UTC 2015 9d4614e6fb3f 10.10.16.243:5000/jefftest:v74 "/runner/init start 2 minutes ago Up 2 minutes 0.0.0.0:32899->5000/tcp jefftest_v74.web.2 Thu Nov 5 22:06:47 UTC 2015 9d4614e6fb3f 10.10.16.243:5000/jefftest:v74 "/runner/init start 2 minutes ago Up 2 minutes 0.0.0.0:32899->5000/tcp jefftest_v74.web.2 Thu Nov 5 22:06:48 UTC 2015 9d4614e6fb3f 10.10.16.243:5000/jefftest:v74 "/runner/init start 2 minutes ago Up 2 minutes 0.0.0.0:32899->5000/tcp jefftest_v74.web.2 Thu Nov 5 22:06:49 UTC 2015 9d4614e6fb3f 10.10.16.243:5000/jefftest:v74 "/runner/init start 2 minutes ago Up 2 minutes 0.0.0.0:32899->5000/tcp jefftest_v74.web.2
I have also seen all of the containers get stopped when scaling from 3 to 2. Though I have only been able to reproduce this when HEALTHCHECK_URL is not set so far.
14:42:50 [ds12] - /Users/jefflee $ deis ps:scale web=2 -a jefftest Scaling processes... but first, coffee! done in 6s === jefftest Processes 14:44:49 [ds12] - /Users/jefflee $ deis info -a jefftest === jefftest Application updated: 2015-11-05T22:44:49UTC uuid: 20949ab0-ffbd-4442-b490-f7b01951976b created: 2015-11-05T18:43:43UTC url: jefftest.ds12.therealreal.com owner: jefflee id: jefftest === jefftest Processes === jefftest Domains
Copied from original issue: deis/deis#4719
From @carmstrong on November 5, 2015 23:37
Though I have only been able to reproduce this when HEALTHCHECK_URL is not set so far.
Have you seen any issues with HEALTHCHECK_URL set? We strongly recommend using this as a best practice for app deploys, since by default we will consider all running containers to be live and healthy.
From @jeff-lee on November 6, 2015 0:9
I haven't been able to reproduce the container=0 issue yet when HEALTHCHECK_URL is set, but the shutdown of the new containers and hang of the builder does still happen.
I have also seen 502 Bad Gateway and 404's when it's set if I scale down late enough in the deploy process.
From @carmstrong on November 6, 2015 0:11
I have also seen 502 Bad Gateway and 404's when it's set if I scale down late enough in the deploy process.
In general, I don't know if we make any guarantees when scaling an app up/down while a deploy is already running - the controller is executing logic as to how many containers to scale up/down based on the current number it sees.
Is there a use case for this, @jeff-lee, or are you just doing resiliency testing?
From @jeff-lee on November 6, 2015 0:56
@carmstrong I was doing resiliency testing of the build process and this popped up.
Having said that, our CI is pushing builds to staging and qa throughout the day so I don't think it would be unusual for someone to try to scale an app without knowing that a build might be in progress.
It would be less of an issue in production since that's a more controlled process.
From @mboersma on November 11, 2015 15:54
I don't think it would be unusual for someone to try to scale an app without knowing that a build might be in progress
Sounds like a common case that Deis should handle gracefully.
From @bacongobbler on January 22, 2016 0:50
I'm not sure if there's an easy way to resolve this reliably. There are a lot of concurrency issues related to Deis. This is one of them. Perhaps at some point we could use something that acts a single source of truth to tells us when the builder is performing a build, but I don't see an easy solution to this problem that we could tackle for the LTS release.
From @bacongobbler on January 22, 2016 0:52
see also https://github.com/deis/deis/issues/4746
This issue was moved to teamhephy/controller#34