fabric-tna
fabric-tna copied to clipboard
MPLS TTL behaviour might be wrong
Currently, we set the MPLS TTL value to a default one(64), however, we should copy the TTL from the IP header. We also need to set the TTL back to the IP header when we pop the MPLS label.
Not much detail in the original RFC showing how to handle TTL with IP packet https://tools.ietf.org/html/rfc3031#section-3.23
But there are some rules in RFC2032 (MPLS Label Stack Encoding) https://tools.ietf.org/html/rfc3032#section-2.4.3
2.4.3. IP-dependent rules
We define the "IP TTL" field to be the value of the IPv4 TTL field,
or the value of the IPv6 Hop Limit field, whichever is applicable.
When an IP packet is first labeled, the TTL field of the label stack
entry MUST BE set to the value of the IP TTL field. (If the IP TTL
field needs to be decremented, as part of the IP processing, it is
assumed that this has already been done.)
When a label is popped, and the resulting label stack is empty, then
the value of the IP TTL field SHOULD BE replaced with the outgoing
TTL value, as defined above. In IPv4 this also requires modification
of the IP header checksum.
It is recognized that there may be situations where a network
administration prefers to decrement the IPv4 TTL by one as it
traverses an MPLS domain, instead of decrementing the IPv4 TTL by the
number of LSP hops within the domain.
Also, there are some explanations on these websites: https://www.ciscopress.com/articles/article.asp?p=680824&seqNum=4 http://wiki.kemot-net.com/mpls-ttl-behavior
Which says we need to set the TTL value to TTL-1
from the previous header (push/swap/pop)
Following the RFC means having the final IP TTL decremented by the number of hops inside the fabric (e.g, ttl = ttl - 2
for packets going from one leaf to the other in a 2x2).
But considering that we use MPLS only inside the fabric (i.e., we don't peer with external MPLS routers) and that Trellis abstracts the whole fabric as one big IP router, do we need to follow the RFC? Or should we just make sure that the IP TTL is decremented by one independently of the number of hops inside the fabric? cc @charlesmcchan @pierventre
Shouldn't it be -3 instead of -2? Our case should be the same as figure 3-4 in this page
I believe SR does both COPY_OUT
and COPY_IN
at the first and last hop respectively.
Yes, it should be -3
...
I just realized that since we do penultimate hop popping (i.e., spine pops MPLS), without copying the TTL between MPLS and IP, we cannot prevent loops inside the fabric...
I still think that decrementing the IP TTL by the number of hops inside the fabric is wrong. IMO the fabric should behave like one big router between the access devices and the Internet. The fact that we use MPLS tunnels internally is an implementation detail, and the IP TTL should not be affected by the number of switches inside the fabric. Instead, the IP TTL should be frozen when inside the tunnel.
However, using penultimate hop popping doesn't leave us any other choice if we want protection against loops. We should change segmentrouting
to support ultimate hop popping (i.e., dest leaf pops MPLS) to be able to detect tunnels inside the fabric.
What SR does today is completely legit as described in RFC3443 section 3.1.
I found it hard to justify the benefit of making such changes, taking the amount of work that needs to be done into account. We need a stronger reason to prioritize this.
I agree that we don't have strong reasons to do this change. I just wanted to voice my concern.
I will make the change to fix the TTL behavior such that we comply with RFC3443 section 3.1, but most importantly with segmentrouting
flow objectives.