cilium icon indicating copy to clipboard operation
cilium copied to clipboard

tcp connection abnormally interrupted, tcp connect reset

Open llhhbc opened this issue 2 years ago • 12 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

What happened?

  1. in this env: vip is 172.19.206.95, located at 172.19.204.8, this host's cilium_host ip is 245.0.5.212, export by metallb mysql client is 172.18.191.88 mysql server's pod ip is 245.0.9.149 at host 172.19.204.10
  2. I have tcpdump on 172.19.204.10's cluster eth0, and pod's eth0, found some problem: host_204_10.pcap, is decode with vxlan by port 8472 and both search by tcp.port==32789

image

Here is the pcap file: _root_host_204_10.pcap(1).gz

_root_mysql_204_10.pcap.gz

I used to think that the data packets were too big, but after checking the historical data packets, I found that it was normal to have large packets.

image

So I think it may be that cilium has an abnormality in packet processing.

Cilium Version

1.12.7

Kernel Version

4.19

Kubernetes Version

1.18.17

Sysdump

Uploading later

Relevant log output

No response

Anything else?

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

llhhbc avatar Feb 08 '24 06:02 llhhbc

Please update to the latest point release and try reproducing it there.

How does the RST manifest as a problem? What doesn't work?

lmb avatar Feb 08 '24 09:02 lmb

cilium sysdump: https://drive.google.com/file/d/1YFvrzZ998S6UMUxrEuEWdhiySWPijWSC/view?usp=drive_link

llhhbc avatar Feb 08 '24 11:02 llhhbc

@lmb During normal tcp communication, an rst packet suddenly appeared, causing the connection to be disconnected. This rst packet was not initiated by the client, and the data packet returned by the server was not returned normally.

llhhbc avatar Feb 08 '24 11:02 llhhbc

Have you tried newer versions?

lmb avatar Feb 13 '24 09:02 lmb

This is a production environment and cannot be upgraded at will.

llhhbc avatar Feb 18 '24 01:02 llhhbc

@lmb What other information do you need? I have already uploaded the relevant information.

llhhbc avatar Feb 18 '24 01:02 llhhbc

@llhhbc, since the version you are running, there have been 12 releases, each with more than a hundred fixes. There is a reasonable chance that whatever is causing your issue has already been fixed. Without you upgrading to the latest patch release and checking that the problem still exists, it's not possible to do much troubleshooting.

youngnick avatar Feb 18 '24 22:02 youngnick

@youngnick At present, this has affected the customer's production system. We need to give the customer a clear conclusion report, including how the problem occurred and whether it has been fundamentally resolved. I'm not sure if it's caused by too many connections, because we don't know the cause now, and we can't reproduce this problem.

企业微信截图_17082626672058

And I'm not sure why this map size is so big? Does cilium automatically expand based on the environment?

llhhbc avatar Feb 19 '24 01:02 llhhbc

I'm sorry @llhhbc, but we can't help you troubleshoot an issue with your environment.

We can help if you can confirm that there is a bug, and that it persists in the most recent release of Cilium 1.12, 1.12.7, with a reproduction.

youngnick avatar Feb 19 '24 02:02 youngnick

@youngnick Because we don’t understand the underlying logic, we can only know the phenomenon now. Moreover, this problem cannot be reproduced in the test environment, so we cannot be sure and cannot narrow down the scope of the problem. Can you help locate the problem? We also want to confirm if the new version has resolved the issue. Can we communicate on slack?
image

my slack account is longhui.li

llhhbc avatar Feb 19 '24 02:02 llhhbc

I'm sorry @llhhbc, but I cannot interactively troubleshoot this issue with you on Slack either. As I said, if you can find a reproduction, then I can assign this issue out to someone. But I am unable to spend time troubleshooting with you.

youngnick avatar Feb 19 '24 02:02 youngnick

image

llhhbc avatar Feb 19 '24 03:02 llhhbc

May be worth a try if you still have no clue:

sudo pwru --output-tuple --output-meta --filter-track-skb 'host $mysql_pod_ip and port 3306 and tcp[tcpflags]&tcp-rst!=0'

jschwinger233 avatar Feb 19 '24 06:02 jschwinger233

@jschwinger233 pwru require kernel > 5.3, this kernel is 4.19

llhhbc avatar Feb 21 '24 01:02 llhhbc

It has been solved. It was other components that modified the cilium data package.

llhhbc avatar Feb 26 '24 06:02 llhhbc

@llhhbc

It has been solved. It was other components that modified the cilium data package.

I think I'm seeing something similar. Can you elaborate on "other components that modified the cilium data package"?

jasonaliyetti avatar Apr 26 '24 19:04 jasonaliyetti