cms icon indicating copy to clipboard operation
cms copied to clipboard

Killing a service from AWS fails with RPC error

Open andreyv opened this issue 5 years ago • 2 comments

Steps to reproduce:

  1. Open administrator interface and navigate to Resource Usage -> All.
  2. Try to kill a Worker.

Actual result in AWS logs:

2019-02-24 17:34:29,727 - ERROR [Admin,0 20 rpc::process_incoming_response] ResourceService,0 signaled RPC for method kill_service was unsuccessful: RPCError: Write failed.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/cms-1.5.dev0-py3.6.egg/cms/io/rpc.py", line 402, in process_incoming_request
    response["__data"] = method(**request["__data"])
  File "/usr/local/lib/python3.6/dist-packages/cms-1.5.dev0-py3.6.egg/cms/service/ResourceService.py", line 444, in kill_service
    return result.get()
  File "/usr/lib/python3/dist-packages/gevent/event.py", line 375, in get
    return self._raise_exception()
  File "/usr/lib/python3/dist-packages/gevent/event.py", line 355, in _raise_exception
    reraise(*self.exc_info)
  File "/usr/lib/python3/dist-packages/gevent/_compat.py", line 34, in reraise
    raise value
cms.io.rpc.RPCError: Write failed.
.

Meanwhile, ResourceService reports that everything is fine:

2019-02-24 17:34:29,724 - INFO [Resource,0 7 ResourceService::kill_service] Killing Worker,0 as asked.
2019-02-24 17:34:29,727 - INFO [Resource,0 8 rpc::initialize] Established connection with localhost:26000 (Worker,0) (local address: 127.0.0.1:58716).

andreyv avatar Feb 24 '19 15:02 andreyv

The underlying error is OSError: Not connected., coming from cms/io/rpc.py: https://github.com/cms-dev/cms/blob/d4c9e926bd52d8022069c417b206b0882ef4d1ba/cms/io/rpc.py#L263-L264

In kill_service(), service.connected is still false even when the RPC call is executed here: https://github.com/cms-dev/cms/blob/d4c9e926bd52d8022069c417b206b0882ef4d1ba/cms/service/ResourceService.py#L442-L443

I believe this is a race condition between the above code and the connection loop that is started in https://github.com/cms-dev/cms/blob/d4c9e926bd52d8022069c417b206b0882ef4d1ba/cms/io/rpc.py#L503-L509

service.connect() just starts the loop and returns without waiting until the connection is established.

Inserting time.sleep(1) before remote_service.quit() makes the call succeed. (EDIT: Should've used gevent.sleep() instead, that works too.)

Perhaps the connect() function should be made synchronous, or RPC calls should check and wait for a "connected" semaphore before actually sending data.

andreyv avatar Aug 19 '19 12:08 andreyv

Looks like this is a regression from f954c2f7a89bfb47f60e0e1fcac025768341246b.

andreyv avatar Aug 28 '19 19:08 andreyv