incubator-pegasus icon indicating copy to clipboard operation
incubator-pegasus copied to clipboard

`Negotiation failed` java client after pegasus servers with authentication enabled were restarted

Open empiredan opened this issue 4 months ago • 0 comments

After all servers (including both meta/replica servers) in a pegasus cluster with authentication enabled were restarted, the java client that had connected to them before would never succeed in connecting to the meta server due to Negotiation failed with ERR_TIMEOUT for port 33391 as follows:

I1012 17:22:33.122887 28573 MetaSession.java:197] query meta got error, rpc error(ERR_OK), meta error(ERR_FORWARD_TO_OTHERS), forward address(rpc_address(10.1.136.90:8170)), current leader
(rpc_address(10.1.136.41:8170)), remain retry count(1), need switch leader(true), need delay(false)
I1012 17:22:33.123011 28571 ReplicaSession.java:189] rpc_address(10.1.136.90:8170): the session is disconnected, needs to reconnect
I1012 17:22:33.123739 28522 ReplicaSession.java:204] rpc_address(10.1.136.90:8170): start to async connect to target, wait channel to active
I1012 17:22:33.123829 28522 ReplicaSession.java:424] Channel [id: 0xea9a05e5, L:/10.1.136.26:33391 - R:/10.1.136.90:8170] for session rpc_address(10.1.136.90:8170) is active

......

E1012 17:22:43.134124 28520 Negotiation.java:85] Negotiation failed
Java exception follows:
com.xiaominew.infra.pegasus.rpc.ReplicationException: ERR_TIMEOUT
        at com.xiaominew.infra.pegasus.security.Negotiation$RecvHandler.run(Negotiation.java:81)
        at com.xiaominew.infra.pegasus.rpc.async.ReplicaSession.tryNotifyFailureWithSeqID(ReplicaSession.java:321)
        at com.xiaominew.infra.pegasus.rpc.async.ReplicaSession$4.run(ReplicaSession.java:367)
        at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
        at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:464)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
I1012 17:22:43.134315 28520 ReplicaSession.java:138] channel to rpc_address(10.1.136.90:8170) closed
W1012 17:22:43.134387 28522 ReplicaSession.java:418] Channel [id: 0xea9a05e5, L:/10.1.136.26:33391 ! R:/10.1.136.90:8170] for session rpc_address(10.1.136.90:8170) is inactive
I1012 17:22:43.134445 28522 ReplicaSession.java:274] rpc_address(10.1.136.90:8170): mark the session to be disconnected from state=CONNECTED

corresponding logs of primary meta server is as follows:

D2024-10-12 17:22:33.123 (1728724953123732943 6bad)   meta.io-thrd.27565: server_negotiation.cpp:40:start(): SERVER_NEGOTIATION(CLIENT=10.1.136.26:33391): start negotiation
D2024-10-12 17:22:33.123 (1728724953123744766 6bad)   meta.io-thrd.27565: network.cpp:702:on_server_session_accepted(): server session accepted, remote_client = 10.1.136.26:33391, current_count = 6
D2024-10-12 17:22:33.123 (1728724953123748123 6bad)   meta.io-thrd.27565: network.cpp:707:on_server_session_accepted(): ip session increased, remote_client = 10.1.136.26:33391, current_count = 4
D2024-10-12 17:22:33.123 (1728724953123964597 6bad)   meta.io-thrd.27565: network.cpp:744:on_server_session_disconnected(): session 10.1.136.26:33391 disconnected, the total client sessions count remains 5
D2024-10-12 17:22:33.123 (1728724953123970218 6bad)   meta.io-thrd.27565: network.cpp:754:on_server_session_disconnected(): client ip 10.1.136.26:33391 has still 3 of sessions to this server
W2024-10-12 17:22:33.123 (1728724953123972893 6bad)   meta.io-thrd.27565: network.cpp:397:on_failure(): disconnect to remote {}, the socket will be lazily closed when the session destroyed
D2024-10-12 17:22:33.123 (1728724953123997959 6bbb)   meta.default7.01006bad00011326: negotiation_manager.cpp:376:get_negotiation(): negotiation was removed for msg: RPC_NEGOTIATION, 10.1.136.26:33391

Or Negotiation failed with ERR_SESSION_RESET for port 33375 as follows:

I1012 17:22:23.114938 28573 MetaSession.java:197] query meta got error, rpc error(ERR_OK), meta error(ERR_FORWARD_TO_OTHERS), forward address(rpc_address(10.1.136.90:8170)), current leader
(rpc_address(10.1.136.41:8170)), remain retry count(4), need switch leader(true), need delay(false)
I1012 17:22:23.115047 28571 ReplicaSession.java:189] rpc_address(10.1.136.90:8170): the session is disconnected, needs to reconnect
I1012 17:22:23.115684 28572 ReplicaSession.java:204] rpc_address(10.1.136.90:8170): start to async connect to target, wait channel to active
I1012 17:22:23.115765 28572 ReplicaSession.java:424] Channel [id: 0xae9c99dd, L:/10.1.136.26:33375 - R:/10.1.136.90:8170] for session rpc_address(10.1.136.90:8170) is active

......

W1012 17:22:28.117709 28570 ReplicaSession.java:309] rpc_address(10.1.136.90:8170): actively close the session because it's not responding for 30 seconds
I1012 17:22:28.117861 28570 ReplicaSession.java:138] channel to rpc_address(10.1.136.90:8170) closed
W1012 17:22:28.117931 28572 ReplicaSession.java:418] Channel [id: 0xae9c99dd, L:/10.1.136.26:33375 ! R:/10.1.136.90:8170] for session rpc_address(10.1.136.90:8170) is inactive
I1012 17:22:28.118019 28570 MetaSession.java:197] query meta got error, rpc error(ERR_SESSION_RESET), meta error(ERR_UNKNOWN), forward address(null), current leader(rpc_address(10.1.136.90:8170)), remain retry count(3), need switch leader(true), need delay(false)
E1012 17:22:28.118207 28572 Negotiation.java:85] Negotiation failed
Java exception follows:
com.xiaominew.infra.pegasus.rpc.ReplicationException: ERR_SESSION_RESET
        at com.xiaominew.infra.pegasus.security.Negotiation$RecvHandler.run(Negotiation.java:81)
        at com.xiaominew.infra.pegasus.rpc.async.ReplicaSession.tryNotifyFailureWithSeqID(ReplicaSession.java:321)
        at com.xiaominew.infra.pegasus.rpc.async.ReplicaSession.markSessionDisconnect(ReplicaSession.java:266)
        at com.xiaominew.infra.pegasus.rpc.async.ReplicaSession$DefaultHandler.channelInactive(ReplicaSession.java:419)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:240)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:226)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
        at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:379)
        at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:344)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:240)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:226)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1299)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:240)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:226)
        at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:903)
        at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:768)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:464)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
I1012 17:22:28.118362 28572 ReplicaSession.java:138] channel to rpc_address(10.1.136.90:8170) closed
I1012 17:22:28.118568 28572 ReplicaSession.java:274] rpc_address(10.1.136.90:8170): mark the session to be disconnected from state=CONNECTED

corresponding logs of primary meta server is as follows:

D2024-10-12 17:22:23.115 (1728724943115678962 6bad)   meta.io-thrd.27565: server_negotiation.cpp:40:start(): SERVER_NEGOTIATION(CLIENT=10.1.136.26:33375): start negotiation
D2024-10-12 17:22:23.115 (1728724943115698138 6bad)   meta.io-thrd.27565: network.cpp:702:on_server_session_accepted(): server session accepted, remote_client = 10.1.136.26:33375, current_count = 6
D2024-10-12 17:22:23.115 (1728724943115702576 6bad)   meta.io-thrd.27565: network.cpp:707:on_server_session_accepted(): ip session increased, remote_client = 10.1.136.26:33375, current_count = 4
D2024-10-12 17:22:23.115 (1728724943115923801 6bad)   meta.io-thrd.27565: network.cpp:744:on_server_session_disconnected(): session 10.1.136.26:33375 disconnected, the total client sessions count remains 5
D2024-10-12 17:22:23.115 (1728724943115929441 6bad)   meta.io-thrd.27565: network.cpp:754:on_server_session_disconnected(): client ip 10.1.136.26:33375 has still 3 of sessions to this server
W2024-10-12 17:22:23.115 (1728724943115959959 6bad)   meta.io-thrd.27565: network.cpp:397:on_failure(): disconnect to remote {}, the socket will be lazily closed when the session destroyed
D2024-10-12 17:22:23.115 (1728724943115988613 6bb7)   meta.default3.01006bad00011320: negotiation_manager.cpp:376:get_negotiation(): negotiation was removed for msg: RPC_NEGOTIATION, 10.1.136.26:33375

empiredan avatar Oct 17 '24 11:10 empiredan