tcp connection abnormally interrupted, tcp connect reset
Is there an existing issue for this?
- [X] I have searched the existing issues
What happened?
- in this env: vip is 172.19.206.95, located at 172.19.204.8, this host's cilium_host ip is 245.0.5.212, export by metallb mysql client is 172.18.191.88 mysql server's pod ip is 245.0.9.149 at host 172.19.204.10
- I have tcpdump on 172.19.204.10's cluster eth0, and pod's eth0, found some problem: host_204_10.pcap, is decode with vxlan by port 8472 and both search by tcp.port==32789
Here is the pcap file: _root_host_204_10.pcap(1).gz
I used to think that the data packets were too big, but after checking the historical data packets, I found that it was normal to have large packets.
So I think it may be that cilium has an abnormality in packet processing.
Cilium Version
1.12.7
Kernel Version
4.19
Kubernetes Version
1.18.17
Sysdump
Uploading later
Relevant log output
No response
Anything else?
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Please update to the latest point release and try reproducing it there.
How does the RST manifest as a problem? What doesn't work?
cilium sysdump: https://drive.google.com/file/d/1YFvrzZ998S6UMUxrEuEWdhiySWPijWSC/view?usp=drive_link
@lmb During normal tcp communication, an rst packet suddenly appeared, causing the connection to be disconnected. This rst packet was not initiated by the client, and the data packet returned by the server was not returned normally.
Have you tried newer versions?
This is a production environment and cannot be upgraded at will.
@lmb What other information do you need? I have already uploaded the relevant information.
@llhhbc, since the version you are running, there have been 12 releases, each with more than a hundred fixes. There is a reasonable chance that whatever is causing your issue has already been fixed. Without you upgrading to the latest patch release and checking that the problem still exists, it's not possible to do much troubleshooting.
@youngnick At present, this has affected the customer's production system. We need to give the customer a clear conclusion report, including how the problem occurred and whether it has been fundamentally resolved. I'm not sure if it's caused by too many connections, because we don't know the cause now, and we can't reproduce this problem.
And I'm not sure why this map size is so big? Does cilium automatically expand based on the environment?
I'm sorry @llhhbc, but we can't help you troubleshoot an issue with your environment.
We can help if you can confirm that there is a bug, and that it persists in the most recent release of Cilium 1.12, 1.12.7, with a reproduction.
@youngnick Because we don’t understand the underlying logic, we can only know the phenomenon now. Moreover, this problem cannot be reproduced in the test environment, so we cannot be sure and cannot narrow down the scope of the problem. Can you help locate the problem? We also want to confirm if the new version has resolved the issue. Can we communicate on slack?
my slack account is longhui.li
I'm sorry @llhhbc, but I cannot interactively troubleshoot this issue with you on Slack either. As I said, if you can find a reproduction, then I can assign this issue out to someone. But I am unable to spend time troubleshooting with you.
May be worth a try if you still have no clue:
sudo pwru --output-tuple --output-meta --filter-track-skb 'host $mysql_pod_ip and port 3306 and tcp[tcpflags]&tcp-rst!=0'
@jschwinger233 pwru require kernel > 5.3, this kernel is 4.19
It has been solved. It was other components that modified the cilium data package.
@llhhbc
It has been solved. It was other components that modified the cilium data package.
I think I'm seeing something similar. Can you elaborate on "other components that modified the cilium data package"?