dubbo
dubbo copied to clipboard
SingleProtocolConnectionManager conccurrent issue
- [ ] I have searched the issues of this repository and believe that this is not a duplicate.
Environment
- Dubbo version: 3.2.6
- Operating System version: centos 7.6
- Java version: 21
- Registration center nacos
- ConnectionManager uses SingleProtocolConnectionManager
Steps to reproduce this issue
A--->B
- Delete all .dubbo files
- System A is started first. Here A subscribes to service B in an instance-less state. But it already has a subscription relationship with System B
- In system B, system A will detect org.apache.dubbo.registry.client.ServiceDiscoveryRegistry#doSubscribe
- Finally, the org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener#onEvent method will be used
- get MetadataInfo with revision, since there is no cache. A corresponding MetadataService remote service of system B will be created to obtain the corresponding metadataInfo.
- org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener#doOnEvent
- serviceDiscovery.getRemoteMetadata
- MetadataUtils.getRemoteMetadata
- MetadataUtils.referProxy(instance), here a connectionClient will be created through SingleProtocolConnectionManager, placed in the cache, and the listening will be turned off to remove connectionClient.addCloseListener(() -> connections.remove(address, connectionClient)) from the cache. This operation is Asynchronous
- MetadataUtils.destroyProxy(proxyHolder), this step is to close the corresponding protocol invoker, and finally close the connectionClient. Finally, the cached connectionClient is removed due to closing the listening.
- PS. SingleProtocolConnectionManager removes connectionClient and turns it into an asynchronous operation
- The process continues with org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener#notifyAddressChanged
- notifyListener.notify(urls), the specific interface will also refresh the subscription relationship
- If you subscribe to application B, the previous services will also obtain the corresponding connectionClient through SingleProtocolConnectionManager. The connectionClient may not have been removed here. The connectionClient is being closed, but it has been marked as unavailable. Therefore some service discovery will fail. Pls. provide [GitHub address] to reproduce this issue.
Expected Behavior
- SingleProtocolConnectionManager, remove connectionClient and do not use listening asynchronous behavior. There is a high probability that there will be no problem.
- Strengthen the org.apache.dubbo.remoting.api.connection.AbstractConnectionClient#retain method
- You have judged that counter<=0, indicating that this instance has been destroyed and has not been removed for some unknown reason. Now only when the channel is closed will it listen for removal.
- AbstractConnectionClient adds a destroyTime, failbackConnectionClient
- The retain method is added and cannot be used anymore. Create a failbackConnectionClient.
- release method, when calling destroy, updates the destroyTime. The scheduled task scans all connections in the cache. If the set destroyTime is exceeded, the connection will be removed directly. If there is a failbackConnectionClient in this connection, failbackConnectionClient will correct it.
- connectionClient.addCloseListener(() -> connections.remove(address, connectionClient))===> connectionClient.remove(connections)... The newly added remove and remove are mutually exclusive
Actual Behavior
If there is an exception, please attach the exception trace:
Just put your stack trace here!
- [ ] I have searched the issues of this repository and believe that this is not a duplicate.
Environment
- Dubbo version: 3.2.6
- Operating System version: centos 7.6
- Java version: 21
- 注册中心 nacos
- ConnectionManager使用的是SingleProtocolConnectionManager
Steps to reproduce this issue
A--->B
- 所有的.dubbo文件都删除掉
- A系统先启动完成。这里A订阅B服务是无实例的状态。但已经和B系统有订阅关系
- B系统时,A系统会感知到org.apache.dubbo.registry.client.ServiceDiscoveryRegistry#doSubscribe
- 最终会走 org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener#onEvent方法
- get MetadataInfo with revision,由于没有缓存。会创建一个对应B系统 MetadataService远程服务获取相应metadataInfo
- org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener#doOnEvent
- serviceDiscovery.getRemoteMetadata
- MetadataUtils.getRemoteMetadata
- MetadataUtils.referProxy(instance) , 这里会通过SingleProtocolConnectionManager,创建一个connectionClient,放到缓存中,并关闭监听从缓存移除 connectionClient.addCloseListener(() -> connections.remove(address, connectionClient)),这个操作是异步的
- MetadataUtils.destroyProxy(proxyHolder) ,这步就是关闭相应协议invoker,最终关闭connectionClient,最终由于关闭监听移除缓存的connectionClient.
- PS. SingleProtocolConnectionManager 移除connectionClient变成异步操作
- 流程继续走org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener#notifyAddressChanged
- notifyListener.notify(urls),具体的接口也会刷新一下订阅关系
- 订阅B应用的,前面几个服务也会通过SingleProtocolConnectionManager,获取相应的connectionClient。这里connectionClient可能还没有移除,connectionClient正在关闭中,但已经标识不可用了。因此部分服务发现会失败。 Pls. provide [GitHub address] to reproduce this issue.
Expected Behavior
- SingleProtocolConnectionManager,移除connectionClient不采用监听异步行为。大概率不会出问题。
- org.apache.dubbo.remoting.api.connection.AbstractConnectionClient#retain 方法加强一下
- 你已经判断 counter<=0,说明已经This instance has been destroyed,不知道什么原因没有被移除掉。现在只有关闭channel才会监听移除。
- AbstractConnectionClient 增加一个destroyTime,failbackConnectionClient
- 增加retain方法出有不能用了,创建一个 failbackConnectionClient
- release方法,调用destroy时,更新destroyTime时间。定时任务扫描所有缓存中connection,超过设置的destroyTime,直接移除这个连接,如果这个connection中有failbackConnectionClient,failbackConnectionClient把它转正。
- connectionClient.addCloseListener(() -> connections.remove(address, connectionClient))===> connectionClient.remove(connections)... 新增加remove和remove 是互斥的
Actual Behavior
If there is an exception, please attach the exception trace:
Just put your stack trace here!
Are you using triple protocol? This seems is a bug of connection management in Triple protocol.
IMO, in order to solve this, we can:
Remove connection itself in org.apache.dubbo.remoting.api.connection.SingleProtocolConnectionManager#connections
before close connection. We should make sure that the action of remove and fetch connection are synchronized which ConcurrentHashMap
can achieve.
Add a FailbackConnection
might not a good idea cause it may cause memory leak. ReferenceCountExchangeClient
in Dubbo protocol is quite easy to cause memory leak, because there are a lot of ways to trigger it, and it has been fixed for more than 10 versions.
Are you using triple protocol? This seems is a bug of connection management in Triple protocol. IMO, in order to solve this, we can: Remove connection itself in
org.apache.dubbo.remoting.api.connection.SingleProtocolConnectionManager#connections
before close connection. We should make sure that the action of remove and fetch connection are synchronized whichConcurrentHashMap
can achieve.Add a
FailbackConnection
might not a good idea cause it may cause memory leak.ReferenceCountExchangeClient
in Dubbo protocol is quite easy to cause memory leak, because there are a lot of ways to trigger it, and it has been fixed for more than 10 versions.
We can override close()
method in NettyConnectionClient
to make sure we have remove the reference in org.apache.dubbo.remoting.api.connection.SingleProtocolConnectionManager#connections
. Once remove finish, check if current client is being used, which is retained in org.apache.dubbo.remoting.api.connection.AbstractConnectionClient#retain
. If so, add connection back to org.apache.dubbo.remoting.api.connection.SingleProtocolConnectionManager#connections
and skip close()
connection.
Remind that remove
and check retain
should be synchronized with org.apache.dubbo.remoting.api.connection.SingleProtocolConnectionManager#connect
.
@EarthChen @icodening PTAL
Are you using triple protocol? This seems is a bug of connection management in Triple protocol. IMO, in order to solve this, we can: Remove connection itself in
org.apache.dubbo.remoting.api.connection.SingleProtocolConnectionManager#connections
before close connection. We should make sure that the action of remove and fetch connection are synchronized whichConcurrentHashMap
can achieve. Add aFailbackConnection
might not a good idea cause it may cause memory leak.ReferenceCountExchangeClient
in Dubbo protocol is quite easy to cause memory leak, because there are a lot of ways to trigger it, and it has been fixed for more than 10 versions.We can override
close()
method inNettyConnectionClient
to make sure we have remove the reference inorg.apache.dubbo.remoting.api.connection.SingleProtocolConnectionManager#connections
. Once remove finish, check if current client is being used, which is retained inorg.apache.dubbo.remoting.api.connection.AbstractConnectionClient#retain
. If so, add connection back toorg.apache.dubbo.remoting.api.connection.SingleProtocolConnectionManager#connections
and skipclose()
connection.Remind that
remove
and checkretain
should be synchronized withorg.apache.dubbo.remoting.api.connection.SingleProtocolConnectionManager#connect
.
我个人觉得增加一个FailbackConnection问题不。只要把握全remove的场景。ReferenceCountExchangeClient的count是原子操作了。我看别人datasource用这种方式已经黑成熟,来感知最后一个操作。 1.正常途径是增加addCloseListner方式来remove; 2.加上一个兜底方式,removeTimeout操作,规定时间内没有remove就foreRemove..我在我的提问中也提到过destroyTime,用来最终强制操作