cortex
cortex copied to clipboard
Prevent or warn about deadlocked rolling updates
Description
When running cortex deploy
in a situation when a rolling update cannot be performed (based on max_surge
/ max_unavailable
), respond with an error rather than reaching a deadlocked state. In the error, mention how to resolve it (increase max_instances
or update update_strategy
)
Delete the relevant section in the "stuck updating" guide
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-update-deployment
I have turned off rolling updates with documentation recommendation of max_surge: 0
and I seem to keep getting into compute unavailable
statuses. The only way it seems to resolve this is having to go through cortex cluster down
then cortex cluster up
then it seems to start working again... This is obviously not ideal for deploying some updates I'm unsure why this is happening with my system, I'm using python, distilgpt-2 with m5.large, min/max instances set to 1...
@ariel-frischer Do you see any useful info in cortex logs <api_name>
? If not, the next time this happens, do you mind running cortex cluster info --debug
, and sending the resulting zip file (which contains the full cluster state) to [email protected]? We'd be happy to take a look to see what's going on!
@ariel-frischer Do you see any useful info in
cortex logs <api_name>
? If not, the next time this happens, do you mind runningcortex cluster info --debug
, and sending the resulting zip file (which contains the full cluster state) to [email protected]? We'd be happy to take a look to see what's going on!
@deliahu Usually the logs just stop updating, or don't show anything at all. I will send you guys the zip file when I come across this again. Thank you for the support!