cms
cms copied to clipboard
Killing a service from AWS fails with RPC error
Steps to reproduce:
- Open administrator interface and navigate to Resource Usage -> All.
- Try to kill a Worker.
Actual result in AWS logs:
2019-02-24 17:34:29,727 - ERROR [Admin,0 20 rpc::process_incoming_response] ResourceService,0 signaled RPC for method kill_service was unsuccessful: RPCError: Write failed.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/cms-1.5.dev0-py3.6.egg/cms/io/rpc.py", line 402, in process_incoming_request
response["__data"] = method(**request["__data"])
File "/usr/local/lib/python3.6/dist-packages/cms-1.5.dev0-py3.6.egg/cms/service/ResourceService.py", line 444, in kill_service
return result.get()
File "/usr/lib/python3/dist-packages/gevent/event.py", line 375, in get
return self._raise_exception()
File "/usr/lib/python3/dist-packages/gevent/event.py", line 355, in _raise_exception
reraise(*self.exc_info)
File "/usr/lib/python3/dist-packages/gevent/_compat.py", line 34, in reraise
raise value
cms.io.rpc.RPCError: Write failed.
.
Meanwhile, ResourceService reports that everything is fine:
2019-02-24 17:34:29,724 - INFO [Resource,0 7 ResourceService::kill_service] Killing Worker,0 as asked.
2019-02-24 17:34:29,727 - INFO [Resource,0 8 rpc::initialize] Established connection with localhost:26000 (Worker,0) (local address: 127.0.0.1:58716).
The underlying error is OSError: Not connected.
, coming from cms/io/rpc.py
: https://github.com/cms-dev/cms/blob/d4c9e926bd52d8022069c417b206b0882ef4d1ba/cms/io/rpc.py#L263-L264
In kill_service()
, service.connected
is still false even when the RPC call is executed here: https://github.com/cms-dev/cms/blob/d4c9e926bd52d8022069c417b206b0882ef4d1ba/cms/service/ResourceService.py#L442-L443
I believe this is a race condition between the above code and the connection loop that is started in https://github.com/cms-dev/cms/blob/d4c9e926bd52d8022069c417b206b0882ef4d1ba/cms/io/rpc.py#L503-L509
service.connect()
just starts the loop and returns without waiting until the connection is established.
Inserting time.sleep(1)
before remote_service.quit()
makes the call succeed. (EDIT: Should've used gevent.sleep()
instead, that works too.)
Perhaps the connect()
function should be made synchronous, or RPC calls should check and wait for a "connected" semaphore before actually sending data.
Looks like this is a regression from f954c2f7a89bfb47f60e0e1fcac025768341246b.