calamari
calamari copied to clipboard
Move run_mon_job to cthulhu
This patchset fixes
https://bugzilla.redhat.com/show_bug.cgi?id=1273559
for 1.3 and
https://bugzilla.redhat.com/show_bug.cgi?id=1347137
for 1.4 (once "backported" for 1.4).
I've tested this on my local cluster and it fixed both the bugs for me (for 1.3 branch).
This PR dropped the patch for the 1.4 issue and now, it contains only the patch for the 1.3 issue.
I’m so happy see this PR, I have the same idea recently.
But if move run_mon_job to cthulhu, what should we do with the follow error, if the remote job really needs more than 10s for running?
"detail": "RPC error ('Lost remote after 10s heartbeat')"
@syf-zsxm FWIW: we are gonna move the function only for 1.3, the 1.4 branch does not need this change since it does not present this issue. The 10s timeout seems like a short one, maybe we should look at a way to make it 30s? (or configurable maybe?)
The 10s timeout seems like a short one, maybe we should look at a way to make it 30s? (or configurable maybe?)
@b-ranto Good idea. We can specify the value of heartbeat when def zerorpc.Client and zerorpc.Server.
In rpc.py
modify
self._server = zerorpc.Server(RpcInterface(manager))
And inrpc_views.py
modify
class ProfiledRpcClient(zerorpc.Client):
# Finger in the air, over 100ms is too long
SLOW_THRESHOLD = 0.2
def __init__(self, *args, **kwargs):
super(ProfiledRpcClient, self).__init__(*args, **kwargs)
But how long is suitbale?