Invoke timeout issue
consumer: dubbo 3.2.1, ordinary synchronous calling mode, the project uses only one interface class of provider Provider: dubbo 3.0.10, using interface + application-level dual registration mode, exposing dual protocols: dubbo 20880 port, triple10800 port nacos registry: 2.2.0
Describe the problem scenario:
The downstream provider recycles the nodes (recycles 6 nodes, gracefully offline) (it will be fine if it is not recycled). Then all nodes of the upstream application experience RPC timeout at the same time and continuously, and the consumer must be restarted to recover.
After checking the upstream logs, we found that this error message is always being reported, and it is all from this downstream node (node IP .139, hereafter referred to as 139)
According to this log, I speculate that the RPC timeout occurs because reconnection is initiated repeatedly, and each reconnection is synchronously waiting for a 3000ms timeout. As a result, the nettyWorker thread is occupied for 3 seconds and cannot respond to the netty network requests of the consumer application and the sender application. This causes RPC time out
And there is another strange thing about this log, Why does the nettyConnectionClient class connect to port 20880? I looked through the source code and local debugging, and it should be that only tripleInvoker uses nettyConnectionClient, while dubboInvoker uses the nettyClient class.
But now the nettyConnectionClient class connects to port 20880, which makes me wonder if it is a problem with protocol exposure and service reference, so I tried to find the difference between this node, and indeed found a doubt. As shown in the figure below: Example 1, 2, 3, 1 The endpoints of instance 2 and 2 are dual protocols. However, the endpoints of instance 3 only have dubbo and no tri.
Theoretically, this should not happen, because their configurations are exactly the same, and instance 3 is expanded (actually we expanded it by 6 points). Let’s not mention why it only has dubbo but no tri. I tried to reproduce this scenario locally. Some instances only expose the dubbo protocol but not the Tri protocol, but it is found that there is no problem when running.
The investigation further down will basically reach a dead end, and the investigation can no longer continue.
We tried to reproduce this scenario online, and the conclusion is that it is 100% reproducible. As long as more nodes are recycled at one time, this scenario can be triggered. And the nodes with problems are all the ones whose endpoints only expose dubbo.
In the first picture, Ref=4 in the red box. I think the source code should be the reference count value for the same connection under the tri protocol. Then it is very strange. After the downstream provider goes offline, it will destroyInvoker. Theory This Ref will become 0 and the connection will eventually be closed, and now it cannot become 0, resulting in the connection not being closed. Reconnection is initiated repeatedly. I see the log as follows. This picture is a node that normally reports no errors during the expansion point. log. During the final recycling, it destroyedInvoker 4 times, dubbo and tri 2 times respectively. In theory, there should be no problem. The same is true for other nodes.
However, the problematic downstream node .139 was destroyed twice, as shown in the figure below. In theory, there is no problem because its endpoints are only dubbo.
So I don’t understand why Ref=4 instead of 2 in the red box in Figure 1. And why it can’t be reduced to 0
By default, there should only one thread to reconnect.
Can you please provide a reproduceable demo?