dolphinscheduler
dolphinscheduler copied to clipboard
[BUG] Coccurency rpc operation might lead tcp connections leak.
Search before asking
- [X] I had searched in the issues and found no similar issues.
What happened
#15954 there are detailed infomation here. there are too many 127.0.0.1:50052 connections, and they lead tcp connection leak. i think it also has the same problem at 3.1.x, but i didn't test it.
What you expected to happen
no
How to reproduce
no
Anything else
No response
Version
3.1.x
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
How many connections build on alert-server? Is this equal to the number of tasks running after restarting the service? How many worker-servers do you have?
1 There are three worker servers.
2 The number of connections can vary and it primarily depends on the length of time the server has been running.
3 connection number doesn't always equal the number of tasks.
2,3 scenarios i got ,i thought becuase connection leaks don't consistently occur。I customized the log print,only when channels.put(host,channel) return result is not a null,then a leak is occured.
Mybe we need to further check the problem, print the connection port at the close channel, and check whether the ports in the system include closed ports.Analyze whether the closing method is called but the channel is not closed。
look at here.the connection is runnig here for many days(I obersevered many connections that as same as this).because the channels(a map that store all the channels) lose the quote of the channel forever, it can't close because we cann't get the channel anymore as the channel losed when channels.put(host,channel) called.
look at here.the connection is runnig here for many days(I obersevered many connections that as same as this).because the channels(a map that store all the channels) lose the quote of the channel forever, it can't close because we cann't get the channel anymore as the channel losed when channels.put(host,channel) called.
When tasks are concurrent (tasks running on the same worker-server), the tasks are A and B. Concurrency occurs when creating channels. NettyRemotingClient.createChannel is called. A creates channelA, B creates channelB, and A uses channelA to send messages, B uses channelB to send messages, but when A closes channel, it takes the channel by host,so get the channel of channelB, and channelB is closed. channelA is free.
当任务出现并发(同一个worker-server运行的任务),任务为A,B,在创建channel的时候出现了并发,调用NettyRemotingClient.createChannel,A创建了channelA,B创建了channelB,A用channelA发送消息,B用channelB发送消息,但是A在关闭的时候根据host拿channel,关闭的是channelB。channelA就游离了。 如果方便的话可以做一个验证,任务串行,看看是不是还有为关闭的连接。如果问题没有复现可能是这个原因,解决方式需要修改建立链接的部分,直接关闭channel,而不是在调用函数在获取一次关闭。或者直接复用channel,干脆不关闭了(需要评估风险)
I have reproduced this issue. In version 3.2.x, it appears that the channel is reused and simply not closed. Keeping the connection was stored in a map seems to be intended for reusing the connection, but closing it after each use negates the purpose of reusing it and only adds complexity. We can add a logic to a handler to close the connection if no data has been transferred for a while, extending this period, for example, over 24 hours. However, upon a preliminary review of the code, versions 3.0 and 3.1 also seem to have this issue.
复现了这个问题。我看3.2.x好像就是复用channel,干脆不关闭了,将连接保存在map中应该是为了复用连接,但是每次调用就关闭根本就没有复用的意义了,反而引入了复杂度(可以加一个逻辑到一个handler里面,如果超过多久没有真正传输数据就关闭,将这个时间调大一点,比如超过24小时). 但3.0和3.1初看代码都会有这个问题.
I will test and observe it at version of 3.1.9, if there is a problem too, I'll dicuss it here and then submit a pr to fix this.
I will test and observe it at version of 3.1.9, if there is a problem too, I'll dicuss it here and then submit a pr to fix this. +1
Good catch, this is due to the concurrent problem in NettyRemotingClient, if multiple operations belong to one host come at the same time, might create multiple channels. I submit https://github.com/apache/dolphinscheduler/pull/16021 to fix this.
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.
look at here.the connection is runnig here for many days(I obersevered many connections that as same as this).because the channels(a map that store all the channels) lose the quote of the channel forever, it can't close because we cann't get the channel anymore as the channel losed when channels.put(host,channel) called.