calamari Move run_mon_job to cthulhu

Move run_mon_job to cthulhu

Open b-ranto opened this issue 8 years ago • 4 comments

This patchset fixes

https://bugzilla.redhat.com/show_bug.cgi?id=1273559

for 1.3 and

https://bugzilla.redhat.com/show_bug.cgi?id=1347137

for 1.4 (once "backported" for 1.4).

I've tested this on my local cluster and it fixed both the bugs for me (for 1.3 branch).

Oct 11 '16 20:10 b-ranto

This PR dropped the patch for the 1.4 issue and now, it contains only the patch for the 1.3 issue.

Nov 04 '16 00:11 b-ranto

I’m so happy see this PR, I have the same idea recently.
But if move run_mon_job to cthulhu, what should we do with the follow error, if the remote job really needs more than 10s for running?

"detail": "RPC error ('Lost remote after 10s heartbeat')"

Nov 04 '16 01:11 syf-zsxm

@syf-zsxm FWIW: we are gonna move the function only for 1.3, the 1.4 branch does not need this change since it does not present this issue. The 10s timeout seems like a short one, maybe we should look at a way to make it 30s? (or configurable maybe?)

Nov 09 '16 09:11 b-ranto

The 10s timeout seems like a short one, maybe we should look at a way to make it 30s? (or configurable maybe?)

@b-ranto Good idea. We can specify the value of heartbeat when def zerorpc.Client and zerorpc.Server.
In rpc.py modify

self._server = zerorpc.Server(RpcInterface(manager))

And inrpc_views.py modify

    class ProfiledRpcClient(zerorpc.Client):
        # Finger in the air, over 100ms is too long
        SLOW_THRESHOLD = 0.2

        def __init__(self, *args, **kwargs):
            super(ProfiledRpcClient, self).__init__(*args, **kwargs)

But how long is suitbale？

Nov 10 '16 01:11 syf-zsxm

calamari calamari copied to clipboard

Move run_mon_job to cthulhu

calamari
calamari copied to clipboard