storm icon indicating copy to clipboard operation
storm copied to clipboard

[STORM-2773]If a drpcserver node in cluster is down,drpc cluster won't work if we don't modify the drpc.server configuration and restart the cluster

Open liu-zhaokun opened this issue 8 years ago • 14 comments

https://issues.apache.org/jira/browse/STORM-2773

There is a cluster which includes three nodes named storm1,storm2,storm3.And there is a drpcserver in every node,a worker which has been started on strom1.When strom1 was down with hardware failure,my drpc topology won't work,when I send request from drpcclient. As storm1 was down,so the worker will be restarted on another node,but it can't Initialize successfully because the call method of Adder will throw a RuntimeException,when drpcspout try to connect to storm1,so the worker will restart again. In conclusion,If a drpcserver node in cluster is down,drpc cluster won't work until we modify the drpc.server configuration and restart the cluster,but in production,it's difficult to restart whole cluster. So I think we should catch the RuntimeException and log it,and the drpc topology will work normally.

liu-zhaokun avatar Oct 10 '17 03:10 liu-zhaokun

You are also going to need to fix the checkstyle errors.

revans2 avatar Oct 10 '17 18:10 revans2

@revans2 I have fix the checkstyle errors and log the exception.Could you help me review this PR again?

liu-zhaokun avatar Oct 11 '17 10:10 liu-zhaokun

Actually I have dug deeper, and I think we want to do this differently, although it will be just as simple of a change.

Currently a ConnectException is thrown deep down that end up being wrapped and bubbles up to be a RuntimeException which you are catching and simply not adding that client to the list of clients. The issue I am concerned about is what happens if the node is temporarily down and will come back up. Before this would cause the worker to restart until it was back up (not ideal but would work). With this patch we will end up never having the spout connect to that drpc server.

I think the proper fix is actually here

https://github.com/liu-zhaokun/storm/blob/d92f1a9c8d7442d4959fec57813fc5de42b179a9/storm-client/src/jvm/org/apache/storm/drpc/DRPCInvocationsClient.java#L44

If instead of letting the Exception bubble up and be handled by the Spout we instead will catch/log it and set the AtomicReference to null. This will allow the spout to do its reconnect logic periodically to try and recover, but will still keep the spout up and running if a node is completely removed.

Longer term it might be interesting to try and replace the config with a location in zookeeper where the DRPC servers can register, so we can add/remove DRPC servers as needed and the system automatically adjusts.

revans2 avatar Oct 11 '17 14:10 revans2

In the long term, it's difficult for us to improve drpc. But would you like to help merge this PR?

liu-zhaokun avatar Oct 16 '17 01:10 liu-zhaokun

@liu-zhaokun I'm not sure you understood my comment from before. This current code has a bug in it. If a DRPC server is down when the spout comes up, even temporarily down, it will never try and connect again. This is not acceptable. We need it to keep trying periodically. I think we can fix it by making the change at

https://github.com/liu-zhaokun/storm/blob/d92f1a9c8d7442d4959fec57813fc5de42b179a9/storm-client/src/jvm/org/apache/storm/drpc/DRPCInvocationsClient.java#L44

instead.

revans2 avatar Oct 16 '17 13:10 revans2

@revans2 I try to fix this bug follow your advice,but I found the original bug occurred in line 41 https://github.com/liu-zhaokun/storm/blob/d92f1a9c8d7442d4959fec57813fc5de42b179a9/storm-client/src/jvm/org/apache/storm/drpc/DRPCInvocationsClient.java#L41 drpcinvocationclient wan't to establish connection with the thrift server.And I am not sure we should fix this bug in line 41.Could you give me some suggestion?Thanks.

liu-zhaokun avatar Oct 26 '17 11:10 liu-zhaokun

@liu-zhaokun You are correct I had forgotten the ThriftClient establishes the connection then initializes the _protocol member and then finally the Client itself is created which will read from the protocol.

The solution is that we need a way to not call reconnect from the constructor of ThriftClient. We have that in a few cases for testing/local mode

https://github.com/liu-zhaokun/storm/blob/d92f1a9c8d7442d4959fec57813fc5de42b179a9/storm-client/src/jvm/org/apache/storm/security/auth/ThriftClient.java#L73-L75

But I think we want another parameter to the constructor to say don't do this, so we can try and connect ourselves and if it fails we can ignore it and go on.

revans2 avatar Oct 26 '17 20:10 revans2

@revans2 Do you mean there no need to reconnect when I create a DRPCInvocationClient instance?

liu-zhaokun avatar Oct 27 '17 06:10 liu-zhaokun

Currently ThriftClient will connect to the server itself in the constructor. What we want to do is to optionally not do that, and instead let the DRPCInvocationsClient do it in a location where it can catch the exception and ignore it, because we know it is going to try again to reconnect later.

revans2 avatar Oct 30 '17 17:10 revans2

Hi,@revans2,I'm not sure whether the ci's failure related to me.And I have add a retry to DRPCInvaction,could you help me to review it?

liu-zhaokun avatar Nov 25 '17 06:11 liu-zhaokun

@HeartSaVioR Hi,are you available to help me review this PR?

liu-zhaokun avatar Nov 28 '17 00:11 liu-zhaokun

@HeartSaVioR @revans2 Hi,are you available to help me review this PR?

liu-zhaokun avatar Dec 04 '17 06:12 liu-zhaokun

Hi,@revans2,I'm not sure whether the ci's failure related to me.And I have add a retry to DRPCInvaction,could you help me to review it?

liu-zhaokun avatar Jan 02 '18 08:01 liu-zhaokun

@revans2 Hi,revans! I guess you forgot this PR after a long time.I just want to know if this PR is available because I used to work hard to make it. Even if you think it is not good enough, I hope you can reply to me rather than ignore it.

liu-zhaokun avatar Mar 20 '18 06:03 liu-zhaokun

We are currently cleaning up old issues with stale discussions. The last comment here was made 5 years ago and Storm as evolved. If this is still an issue or relevant for your work, feel free to re-open and to rebase / fix the related PR.

rzo1 avatar Dec 04 '23 11:12 rzo1