galaxy-server
galaxy-server copied to clipboard
Timeout during upgrade
The first upgrade command timed out:
[23:00 ubuntu@i-64580004:~ prod] galaxy upgrade -b discovery-elb discovery-elb:1.2 @discovery-elb:1
uuid host machine status binary config
0561 10.94.13.48 i-4c451d2c STOPPED discovery-elb:1.1 @discovery-elb:general:1.0
Are you sure you would like to UPGRADE these servers? [y/N] y
java.net.SocketTimeoutException: Read timed out
The second upgrade returned a weird error:
[23:00 ubuntu@i-64580004:~ prod] galaxy upgrade -b discovery-elb discovery-elb:1.2 @discovery-elb:1
uuid host machine status binary config
0561 10.94.13.48 i-4c451d2c STOPPED discovery-elb:1.1 @discovery-elb:general:1.0
Are you sure you would like to UPGRADE these servers? [y/N] y
uuid host machine status binary config
0561 10.94.13.48 i-4c451d2c UNKNOWN discovery-elb:1.1 @discovery-elb:general:1.0 UnexpectedResponseException{request=Request{uri=http://10.94.13.48:65000/v1/agent/slot/0561a95c-8c22-417e-963a-981b2ff9b3fb/assignment, method='PUT', headers={x-galaxy-agent-version=[b9bcdfa080fe634c57f41dd88c09542e], x-galaxy-slot-version=[21e7371f3c7e9c64628d44b964c456e2], Content-Type=[application/json]}, bodyGenerator=com.proofpoint.http.client.JsonBodyGenerator@15fd3c35}, statusCode=500, statusMessage='Could not obtain slot lock within 1000.00ms held by null thread is at com.proofpoint.galaxy.agent.DeploymentSlot.lock(DeploymentSlot.java:346) at com.proofpoint.galaxy.agent.DeploymentSlot.assign(DeploymentSlot.java:163) at com.proofpoint.galaxy.agent.AssignmentResource.assign(AssignmentResource.java:70) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jers', headers={Content-Length=[10834], Content-Type=[text/html;charset=ISO-8859-1], Cache-Control=[must-revalidate,no-cache,no-store]}}
The third succeeded:
[23:00 ubuntu@i-64580004:~ prod] galaxy upgrade -b discovery-elb discovery-elb:1.2 @discovery-elb:1
uuid host machine status binary config
0561 10.94.13.48 i-4c451d2c STOPPED discovery-elb:1.2 @discovery-elb:1
Are you sure you would like to UPGRADE these servers? [y/N] y
uuid host machine status binary config
0561 10.94.13.48 i-4c451d2c STOPPED discovery-elb:1.2 @discovery-elb:1
The timeout might be caused by the Nexus proxy being slow. This was the first access for that artifact.
For the first one, the request timed out in the client. For the second one, the agent timed out waiting for the slot lock, because it was still running the first upgrade request. If you look closely at the third request, the server was already at version 1.2 and you simply upgraded it to 1.2 again.
So all of the problems were caused by the first request taking a long time. This was most likely caused by downloading the binary into your nexus repo. The third command was fast binary was already in you nexus repo.
The real problem here is we timeout too aggressively for long running commands like install and stop, and we need transient states like "installing", "restarting" and "stopping", so the user knows what is going on.