cloud_controller_ng
cloud_controller_ng copied to clipboard
Repeated rollback deployments can create excess processes
While trying to get a large number of revisions on an app, I repeatedly rolled back and forth between 2 revisions. Eventually my deployments started to fail. When @cwlbraa and I investigated further, we found:
- There were 100 Deployments associated with the app
- There were ~3100 "web" processes associated with the app
- The deployment updater logs were not failing in any obvious way
- There were 100 Revisions associated with the app
We suspect PruneExcessAppRevisions
eventually deleted revisions 1 and 2 (each app can have 100 revisions at most), but that doesn't explain how we got thousands of web processes.
Context
I was using "dora" as the app in a single-instance configuration. Then executed rollbacks in rapid succession Rollbacks, pushes, app summary requests failed
Steps to Reproduce
- push dora 3x
- run a script to rollback repeatedly (zsh I used script below)
function ten-thousand-revisions-dora(){
i=0
while [ $i -lt 10000 ]
do
if [[ $i%2 -lt 1 ]]; then
cf rollback dora --revision 2 -f
else
cf rollback dora --revision 1 -f
fi
i=$(($i + 1))
done
}
after running the script again (slightly modified from above since revisions 1 and 2 had been pruned, we see the following error states:
This command is in EXPERIMENTAL stage and may change without notice
Rolling back to revision 3175 for app dora in org o / space s as admin...
OK
This command is in EXPERIMENTAL stage and may change without notice
Rolling back to revision 3176 for app dora in org o / space s as admin...
OK
This command is in EXPERIMENTAL stage and may change without notice
Rolling back to revision 3175 for app dora in org o / space s as admin...
memory quota_exceeded
FAILED
This command is in EXPERIMENTAL stage and may change without notice
Rolling back to revision 3176 for app dora in org o / space s as admin...
Unable to rollback. The code and configuration you are rolling back to is the same as the deployed revision.
FAILED
This command is in EXPERIMENTAL stage and may change without notice
Rolling back to revision 3175 for app dora in org o / space s as admin...
memory quota_exceeded
FAILED
Expected result
Either:
- My deployments fail "gracefully" when the cluster runs out of resources
- I should hit a limit on deployments/app
And:
- There should never be more Processes than Deployments on the app.
Current result
~3000 revisions were created before the script started failing on the front end
subsequent attempts to push different apps failed with "Insufficient Resources: insufficient resources" errors
CLI cf apps
and cf app dora
took hours to return results.
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/174940465
The labels on this github issue will be updated when the story is started.