dubbo Invoke timeout issue

consumer: dubbo 3.2.1, ordinary synchronous calling mode, the project uses only one interface class of provider Provider: dubbo 3.0.10, using interface + application-level dual registration mode, exposing dual protocols: dubbo 20880 port, triple10800 port nacos registry: 2.2.0

Describe the problem scenario: The downstream provider recycles the nodes (recycles 6 nodes, gracefully offline) (it will be fine if it is not recycled). Then all nodes of the upstream application experience RPC timeout at the same time and continuously, and the consumer must be restarted to recover. After checking the upstream logs, we found that this error message is always being reported, and it is all from this downstream node (node IP .139, hereafter referred to as 139) 1706773100016

According to this log, I speculate that the RPC timeout occurs because reconnection is initiated repeatedly, and each reconnection is synchronously waiting for a 3000ms timeout. As a result, the nettyWorker thread is occupied for 3 seconds and cannot respond to the netty network requests of the consumer application and the sender application. This causes RPC time out

And there is another strange thing about this log, Why does the nettyConnectionClient class connect to port 20880? I looked through the source code and local debugging, and it should be that only tripleInvoker uses nettyConnectionClient, while dubboInvoker uses the nettyClient class. But now the nettyConnectionClient class connects to port 20880, which makes me wonder if it is a problem with protocol exposure and service reference, so I tried to find the difference between this node, and indeed found a doubt. As shown in the figure below: Example 1, 2, 3, 1 The endpoints of instance 2 and 2 are dual protocols. However, the endpoints of instance 3 only have dubbo and no tri. 1706771244967

Theoretically, this should not happen, because their configurations are exactly the same, and instance 3 is expanded (actually we expanded it by 6 points). Let’s not mention why it only has dubbo but no tri. I tried to reproduce this scenario locally. Some instances only expose the dubbo protocol but not the Tri protocol, but it is found that there is no problem when running.

The investigation further down will basically reach a dead end, and the investigation can no longer continue.

Note:

We tried to reproduce this scenario online, and the conclusion is that it is 100% reproducible. As long as more nodes are recycled at one time, this scenario can be triggered. And the nodes with problems are all the ones whose endpoints only expose dubbo.
In the first picture, Ref=4 in the red box. I think the source code should be the reference count value for the same connection under the tri protocol. Then it is very strange. After the downstream provider goes offline, it will destroyInvoker. Theory This Ref will become 0 and the connection will eventually be closed, and now it cannot become 0, resulting in the connection not being closed. Reconnection is initiated repeatedly. I see the log as follows. This picture is a node that normally reports no errors during the expansion point. log. During the final recycling, it destroyedInvoker 4 times, dubbo and tri 2 times respectively. In theory, there should be no problem. The same is true for other nodes. However, the problematic downstream node .139 was destroyed twice, as shown in the figure below. In theory, there is no problem because its endpoints are only dubbo. So I don’t understand why Ref=4 instead of 2 in the red box in Figure 1. And why it can’t be reduced to 0

consumer : dubbo 3.2.1 , 普通同步调用模式 , 项目使用了provider的仅一个接口类 provider : dubbo 3.0.10 , 使用接口+应用级双注册模式 , 暴露了双协议 : dubbo 20880端口 , triple10800端口 nacos registry: 2.2.0

描述下问题场景 : 下游提供者回收节点(回收6个,优雅下线方式) (不回收的话就没事) . 然后上游应用所有节点同时出现RPC超时现象且持续不断 , 必须重启消费者才能恢复. 排查上游日志发现 : 此报错信息一直在刷 , 并且都是这个下游节点(节点IP .139,以下以139为简称) 1706773100016

根据此日志 , 我推测发生RPC超时是因为反复发起重连,每次重连是同步等待3000ms超时.导致nettyWorker线程被占用3秒无法响应消费者应用和发送者应用的netty网络请求了.导致RPC超时

并且此日志还有一个奇怪之处 , 为什么nettyConnectionClient这个类会去连接20880端口 ? 我翻阅了源码以及本地调试 , 应该只有tripleInvoker才会使用nettyConnectionClient , 而dubboInvoker使用的是nettyClient这个类 . 但是现在nettyConnectionClient类去连接20880端口让我怀疑是否是协议暴露和服务引用的问题 , 所以我尝试找此节点的区别 , 确实发现了一个疑点 . 如下图所示 : 实例1 , 2 , 3 中 , 1和2的endpoints都是双协议. 但是instance 3的endpoints却只有dubbo没有tri . 1706771244967

理论上这个不应该发生 , 因为它们配置是完全一样的 , 3号实例是扩容的(实际上我们扩容了6个点) . 暂且不提为什么它只有dubbo没有tri . 我尝试在本地复现此场景某些实例只暴露dubbo协议不暴露Tri协议 , 但是发现跑下来并没有问题 .

再往下的排查就基本进入死角了 , 查不下去了 .

另注 :

我们在线上尝试复现了此场景 , 回收的节点一次性较多就可以触发这个场景 (第一次尝试时只回收了1个节点,未复现出来) 第二次回收了6个节点. 并且出问题的节点都是endpoints只暴露了dubbo的那种 .
第一张图上 , 红框里面 Ref=4 , 这个我看源码应该是tri协议下对于同一条连接的引用计数值 , 那么就很奇怪了 , 下游提供者下线后 , 会destroyInvoker , 理论上这个Ref会变成0然后连接最终close掉 , 然后现在它变不成0 , 导致连接没有被close掉 . 反复发起重连. 我看日志如下,这个图是扩容的点中正常没报错的一个节点的日志 . 在最后回收的时候它destroyInvoker了4次 , dubbo和tri分别2次 , 理论上应该没问题 , 其他节点也是这样的 . 但是.139这个有问题的下游节点它是destroyinvoker了两次 , 如下图 , 理论上也没问题,因为它的endpoints只有dubbo . 所以我就不理解了 , 为什么图一的红框里面Ref=4而不是2 . 而且为什么降不成0

Feb 01 '24 06:02 mz0113

By default, there should only one thread to reconnect.

Feb 01 '24 09:02 AlbumenJ

Can you please provide a reproduceable demo?

Mar 08 '24 06:03 AlbumenJ

dubbo dubbo copied to clipboard

Invoke timeout issue

dubbo
dubbo copied to clipboard