calico icon indicating copy to clipboard operation
calico copied to clipboard

Add support for Path MTU discovery

Open sathuish opened this issue 4 years ago • 7 comments

We have an AWS multi-region instance. We are trying to install the application across the regions and the deployment is getting filed due to MTU auto-detection enabled in cni.

Expected Behavior

The data transfer/receiving should happen properly

Current Behavior

We are using calico version 3.18. We have set the MTU value as 0 to auto-detect the MTU for the calico. When it comes to AWS on-prem multi-region deployment, we face issues with the pod to pod communications. when we reduced the MTU value to 1350 the communication works properly without any issues.

Possible Solution

Add Path MTU discovery to Calico

Steps to Reproduce (for bugs)

Enable auto-detection in calico Deploy them in the AWS multi-region Transfer/receive the larger packets

Context

Add Path MTU discovery to Calico

Your Environment

Calico version 3.18.3 Orchestrator version (e.g. kubernetes, mesos, rkt): Kubernetes v1.20.7 Operating System and version: Centos 7.9.2009

  • Link to your project (optional):

sathuish avatar Oct 21 '21 17:10 sathuish

Yep, as you discovered Calico can only detect MTU based on the local node's configuration. This was by design, and of course has some limitations. However, like you said, manual MTU configuration does exist for such situations.

Path MTU discovery might solve this, but is an undertaking we've so far tried to avoid due to the extra complexities it involves. For now, leaving this open as an enhancement, but suggest continuing to use manually configured MTU values.

We are trying to install the application across the regions

I'd also strongly recommend against running a single cluster across multiple regions, and instead use availability zones. A single Kubernetes cluster / Calico cluster across multiple regions is bound to cause you some pain, due to added latency and instability caused by running the control plane across the public internet.

If you need redundancy, I'd recommend a separate cluster-per-region, with nodes spread across AZs within the region.

caseydavenport avatar Oct 29 '21 21:10 caseydavenport

@caseydavenport I wanted to follow up on this existing issue and raise the awareness about problems that happen when advertising services via BGP/ECMP.

In our environment we stumbled upon this and had to implement a fix. Cluster nodes are contained within a single region. Communication within a region always leverages jumbo mtu (inter-node and customer-to-externalIP). Communication between regions happens though upstream routers which have different connectivity means (main MPLS, but also backup VPN).

Even having MPLS in the path have caused issues due to 4 bytes it required for headers. Backup VPN may go though internet and mtu could be 1300 bytes.

The problem is described here https://blog.cloudflare.com/path-mtu-discovery-in-practice/ Cloudflare's implementation is available under https://github.com/cloudflare/pmtud

In our implementation (readme may not be up to date) we:

  1. push icmp 3/4 frag-needed packets into specific nflog group
  2. take the payload of the frag-needed packet and re-send it to all nodes within same cluster (for now, over separate L2 connection).

I am wondering whether that's something you still would consider to be in scope for Calico?

defo89 avatar Jan 11 '24 16:01 defo89

@matthewdupre might be the right one to comment on this.

My first inclination is that this would be best handled as a separate solution, with Calico exposing the necessary surfaces to enable implementing PMTU without actually writing the code into Calico itself. However, I am happy to be convinced otherwise - I am not an expert on PMTU.

caseydavenport avatar Jan 18 '24 00:01 caseydavenport

@caseydavenport I wanted to follow up on this existing issue and raise the awareness about problems that happen when advertising services via BGP/ECMP.

In our environment we stumbled upon this and had to implement a fix. Cluster nodes are contained within a single region. Communication within a region always leverages jumbo mtu (inter-node and customer-to-externalIP). Communication between regions happens though upstream routers which have different connectivity means (main MPLS, but also backup VPN).

Even having MPLS in the path have caused issues due to 4 bytes it required for headers. Backup VPN may go though internet and mtu could be 1300 bytes.

The problem is described here https://blog.cloudflare.com/path-mtu-discovery-in-practice/ Cloudflare's implementation is available under https://github.com/cloudflare/pmtud

In our implementation (readme may not be up to date) we:

  1. push icmp 3/4 frag-needed packets into specific nflog group
  2. take the payload of the frag-needed packet and re-send it to all nodes within same cluster (for now, over separate L2 connection).

I am wondering whether that's something you still would consider to be in scope for Calico?

This is a very interesting discussion , as we are facing the same issue when using Calico in eBPF mode advertising our service ip , we kind of hit the problem of mtu got changed different network when reaches our service IP , when we can't respond to ICMP so packet get dropped.

ehsan310 avatar May 22 '24 13:05 ehsan310

when we can't respond to ICMP so packet get dropped.

@ehsan310 why can't you respond to icmp?

tomastigera avatar May 22 '24 17:05 tomastigera

Any update on this one?

tomastigera avatar Sep 06 '24 16:09 tomastigera

I have upgraded the cluster but did not get a chance to change mtu to default 1500 to see if the issue is fixed , we have a production workload so have to see what can i do to test it @tomastigera

ehsan310 avatar Sep 06 '24 18:09 ehsan310

This issue is stale because it is kind/enhancement or kind/bug and has been open for 180 days with no activity.

github-actions[bot] avatar Jul 10 '25 06:07 github-actions[bot]

This issue was closed because it has been inactive for 30 days since being marked as stale.

github-actions[bot] avatar Aug 09 '25 12:08 github-actions[bot]