sdk Healing doesn't work properly in the case of multiple NSCs

Description

Related to: https://github.com/networkservicemesh/sdk/issues/1357

Currently, healing doesn't work properly if we have several NSCs, because when the NSE is restarted, point2pointipam loses information about the issued IP addresses. After the restart, the order of servicing and issuing addresses during the healing process may not coincide with the initial request (before death).

For example:

We have 2 successfully connected NSC:
NSE was restarted. NSCs started healing but NSC-2 sent Request before NSC-1. The restarted NSE issues new addresses in the order of requests. Since we're storing the previous addresses in the Request, we'll end up with:
Datapath healing is down, because NSC-1 cannot ping 172.16.1.0 anymore

Possible solution:

We can try to recover IP addresses from the Request inside point2pointipam. In this case, clients will only have old addresses.

Sep 26 '22 11:09 glazychev-art

@edwarnicke @denis-tingaikin Any thoughts?

Sep 26 '22 13:09 glazychev-art

@glazychev-art Is this an issue of both IP addresses being in the response to the Request?

I can see this happening one of two ways either:

NSC1 sends Request.Connection.ConnectionContext.IPContext.SrcIP list is sent with 172.16.1.1 and then point2pointipam adds 172.16.1.3 to the list and both IPs are returned to the NSC or
NSC1 sends Request.Connection.ConnectionContext.IPContext.SrcIP is sent with 172.16.1.1 and then point2pointipam replaces it with 172.16.1.3 and one IP address is sent back to NSC1, but the Forwarder fails to clean up the previous IP address.

Which is it? Or is it something else?

Sep 26 '22 14:09 edwarnicke

@edwarnicke Yes, the first way is correct. And also the fact that Request.Connection.ConnectionContext.IPContext.DstIP also contains two addresses - 172.16.1.0 (before healing) and 172.16.1.2 (after healing). But 172.16.1.0 is already wrong in this situation. And datapath healing therefore does not work.

I think it would be great to use old addresses after healing.

Sep 26 '22 17:09 glazychev-art

I think it would be great to use old addresses after healing.

@glazychev-art That was what I was thinking. There's a tricky bit in there though because right now point2pointipam leaves space for other point2pointipam instances to add IPs. So it can't really tell the difference between "This is a new IP" and "Someone before me in the chain set this IP as part of IPAM".

Sep 26 '22 18:09 edwarnicke

@edwarnicke You are right. Maybe we should at least try to recover the addresses? The logic could be like this: 1. NSE has a couple of IPAMs:

 ipv4IPAM - point2pointipam(ipv4CIDR)
 ipv6IPAM - point2pointipam(ipv6CIDR)

2A. If NSE receives the first request from the client, ipv4IPAM adds ipv4 addresses to Request.Connection.ConnectionContext.IPContext.DstIP/SrcIP. The next ipv6IPAM sees that IPContext already contains something and tries to restore it using its ip-pools. If it fails (in this case it will fail), it adds its ipv6 address. 2B. If NSE receives the request after restart (healing) - it sequentially checks all IPAMs to see if someone contains addresses from the IPcontext. If yes, we restore and use the old addresses, if not, allocate new ones. 2C. If refresh - IPAM already contains information in its map

Briefly: if we can, we restore the address, if not, we allocate a new one.

Sep 27 '22 07:09 glazychev-art

@edwarnicke Additional question: Do we understand correctly that the IP addresses for the client and endpoint should not change? What if, for example, healing with reselect?

Sep 27 '22 10:09 glazychev-art

@glazychev-art Whether the IP addresses change on heal will be up to the NSE. I expect many (most?) NSEs will want to provide IP address stability, but maybe not all.

From NSMs point of view:

If we heal and the new NSE permits the same IP address, we should support that correctly.
If we heal, an the new NSE changes the IP address, we should support that correctly as well.

I suspect the best answer for our current point2pointipam would be to simply compare any existing IP in the chain to the prefixes from which its issuing IPs. If the incoming Request has an IP that is in the prefixes a particular p2pipam is issuing from, it could accept that and not issue a new one.

Thoughts?

Sep 27 '22 10:09 edwarnicke

@edwarnicke

I suspect the best answer for our current point2pointipam would be to simply compare any existing IP in the chain to the prefixes from which its issuing IPs. If the incoming Request has an IP that is in the prefixes a particular p2pipam is issuing from, it could accept that and not issue a new one.

Yes, I agree. I think that's what I described here - https://github.com/networkservicemesh/sdk/issues/1358#issuecomment-1259074610

Sep 27 '22 11:09 glazychev-art

@glazychev-art Yes, but the prefix check is important.

For example:

- IPAM1 - point2pointipam(10.0.0.0/24)
- IPAM2 - point2pointipam(10.0.1.0/24)

We don't want IPAM2 deciding it doesn't have to do its work because 10.0.0.1/32 has been assigned already by IPAM1. That's why the prefix checking is important.

Sep 27 '22 11:09 edwarnicke

@edwarnicke Yes, so:

IPAM1 adds 10.0.0.1/32
IPAM2 checks that 10.0.0.1/32 doesn't belong to 10.0.1.0/24 and adds 10.0.1.1/32

Right?

Sep 27 '22 11:09 glazychev-art

Exactly :)

Sep 27 '22 11:09 edwarnicke

Hey,

Maybe we can add the IPAM as a CRD just like NetworkService / NetworkServiceEndpoint? Or maybe reuse the IPAM replica set, used in VL3 in conjunction with a CRD.

I would like to think the IPAM works like DHCP.

Scenario A:

NSC -> NSE, connection is new, ask registry / IPAM CR for new address, store the details like NSC id / other identifier, add expiration

Scenario B:

NSC -> NSE, refresh, connection exists, but not expired, ask registry. / IPAM CR to update expiry based on the NSC identifier

Scenario C:

NSC -> NSE, refresh, connection exists, but it is expired, return CONNECTIONEXPIREDERROR, so that the NSC can do .Close. I am not very familiar with the entire healing / monitor mechanism yet, but I think trying to lease a new ip when old one expires, can lead to issues and calling .Close to create a new data path seems safer here.

The same applies in VL3 if I'm not mistaken. The IPAM replica set has a CIDR, let's say x.x.x.x/16, and NSE's ask for a x.x.x.x/24. We still encounter the same issue here when using multiple NSC's.

Thanks

Sep 28 '22 10:09 yuraxdrumz

I think it is possible to make IPAM like CRD and it allows us remove https://github.com/networkservicemesh/cmd-ipam-vl3

All concurrency problems are resolvable with k8s.

Only one question here: How we'll support IPAM for non-k8s systems? :)

Oct 02 '22 23:10 denis-tingaikin

@yuraxdrumz It would totally be possible to add a crd based ipam chain element. We do need to be cautious to maintain generalizability though because:

NSM runs in non-K8s spaces
NSEs will have wildly varying opinions about how they want to manage IPAM.

So for example having a crdipam chain element would be a fantastic tool for folks to have (and easy to write in sdk-k8s), making it the 'one true ipam' is probably suboptimal :)

Oct 02 '22 23:10 edwarnicke

@denis-tingaikin @edwarnicke

I agree we need to keep it general. I thought about reusing the same design as with registry.

By default, we use registry-k8s, but we also have registry-memory and people that use NSM in a non-k8s can implement the interface to interact with the registry. The same can be achieved with the IPAM.

We can have IPAM-k8s, IPAM-memory, which is basically the code we have today that the NSE use and if someone wants something custom, we have a simple interface to implement.

I am guessing that it will require another proxy for the inter-domain scenario though.

What are you're thoughts?

Oct 02 '22 23:10 yuraxdrumz

I think it is possible to make IPAM like CRD and it allows us remove https://github.com/networkservicemesh/cmd-ipam-vl3

All concurrency problems are resolvable with k8s.

Only one question here: How we'll support IPAM for non-k8s systems? :)

I think we can do like I suggested with the registry, we can keep the ipam and make it run as ipam-k8s for all the k8s needs, add an ipam-memory to support non-k8s systems.

The NSE's, both regular and VL3, will ask the IPAM server for a prefix, almost the same as today.

When a client connects to the NSE, the NSE will call the IPAM server again for .RegisterClient.

Oct 03 '22 12:10 yuraxdrumz

Hey,

I talked about the IPAM, because the end result I am looking for is to have client and endpoint addresses unique and preserved across restarts regardless of using 1 NSC, multiple NSC's or using VL3.

I did a few steps to check some things out:

Fork api and add a new IPAM server V2 with new methods, like CreateEndpoint, which act exactly the same as ManagePrefixes currently for the in memory implementation
Fork sdk and write the code for the ipam/chains/memory according to the above IPAM server v2
Fork cmd-ipam-vl3 and implement a new cmd-ipam-memory
Run the ipam-memory the same as the cmd-ipam-vl3 in my k8s cluster
Fork cmd-nse-icmp-responder, add the code for communicating with cmd-ipam-memory

Now, I see the endpoints work exactly the same as with vl3

Oct  3 14:56:30.568 [INFO] [cmd:[/bin/app]] Configuration: &{[{tcp   :5006   false   }] TRACE otel-collector.observability.svc.cluster.local:4317 169.254.0.0/16 24}
Oct  3 14:56:37.107 [INFO] [cmd:[/bin/app]] SVID: "spiffe://example.org/ns/nsm-system/pod/ipam-memory-84845d75f5-fr7rk/012ee2e8-147f-4702-a279-7aebd11524b4"
2022/10/06 09:25:28 [DEBUG] starting to allocate new prefix for endpoint name = nse-inet-57d959775d-vvjgf
2022/10/06 09:25:28 [DEBUG] didn't find prefix, allocating new prefix = 169.254.0.0/24
2022/10/06 09:25:28 [DEBUG] adding request.prefix = 169.254.0.0/24 to excludedPrefixes
2022/10/06 09:25:28 [DEBUG] adding response.prefix = 169.254.0.0/24 to clientsPrefixes
2022/10/06 09:25:28 [DEBUG] excluding prefix from pool, prefix = 169.254.0.0/24
2022/10/06 09:26:52 [DEBUG] starting to allocate new prefix for endpoint name = nse-inet-57d959775d-97bh2
2022/10/06 09:26:52 [DEBUG] didn't find prefix, allocating new prefix = 169.254.1.0/24
2022/10/06 09:26:52 [DEBUG] adding request.prefix = 169.254.1.0/24 to excludedPrefixes
2022/10/06 09:26:52 [DEBUG] adding response.prefix = 169.254.1.0/24 to clientsPrefixes
2022/10/06 09:26:52 [DEBUG] excluding prefix from pool, prefix = 169.254.1.0/24
2022/10/06 09:52:18 [DEBUG] starting to allocate new prefix for endpoint name = nse-inet-57d959775d-rpssj
2022/10/06 09:52:18 [DEBUG] didn't find prefix, allocating new prefix = 169.254.2.0/24
2022/10/06 09:52:18 [DEBUG] adding request.prefix = 169.254.2.0/24 to excludedPrefixes
2022/10/06 09:52:18 [DEBUG] adding response.prefix = 169.254.2.0/24 to clientsPrefixes
2022/10/06 09:52:18 [DEBUG] excluding prefix from pool, prefix = 169.254.2.0/24

My question is, how can we preserve cidrs/addresses across restarts?

I see the IPAM flow as something like the image below ipam

For example, when an NSE restarts, I have no idea how to identify it in the IPAM.

I looked at the implementation of the VL3 IPAM and to my understanding, if I restart a VL3 NSE, it will be allocated a new CIDR from the IPAM and all clients that were connected to it remain with old addresses.

Do you think the direction I'm going for, which is having a single IPAM that controls all addresses (both NSE, like today in VL3, but for all types of endpoints, and for NSC's) with an option to fallback to original NSE in memory in case the IPAM is not available, a good one?

EDIT:

I just restarted my NSE:

Before, the cidr block it received from IPAM was 169.254.2.0/24
After restart, the cidr block is 169.254.3.0/24
Client did .Request, which resulted in the endpoint having 2 addresses, 169.254.2.0 and 169.254.3.0
Client has 2 addresses as well 169.254.2.1 and 169.254.3.1, both work.
Client pinger still tries to connect to ipContext.DstAddreesses[0], which is 169.254.2.0

Oct  6 12:58:03.706 [INFO] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (2.14)    expiration after 9m59.713992246s at 2022-10-06 13:08:03.420491532 +0000 UTC
Oct  6 12:58:08.706 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.1)       Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.2.0/32
Oct  6 12:58:08.706 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.2)       Creating new pinger with address=169.254.2.0
Oct  6 12:58:09.284 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.25)       Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.3.0/32
Oct  6 12:58:23.807 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.3)       Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.3.0/32
Oct  6 12:58:23.807 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.4)       Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.2.0/32
Oct  6 12:58:23.807 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.5)       Creating new pinger with address=169.254.2.0
Oct  6 12:58:38.909 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.6)       Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.3.0/32
Oct  6 12:58:38.909 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.7)       Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.2.0/32
Oct  6 12:58:38.909 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.8)       Creating new pinger with address=169.254.2.0
Oct  6 12:58:54.011 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.9)       Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.3.0/32
Oct  6 12:58:54.011 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.10)       Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.2.0/32
Oct  6 12:58:54.011 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.11)       Creating new pinger with address=169.254.2.0
Oct  6 12:59:09.113 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.12)       Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.3.0/32
Oct  6 12:59:09.113 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.13)       Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.2.0/32
Oct  6 12:59:09.113 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.14)       Creating new pinger with address=169.254.2.0
Oct  6 12:59:24.215 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.15)       Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.3.0/32
Oct  6 12:59:24.216 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.16)       Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.2.0/32
Oct  6 12:59:24.216 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.17)       Creating new pinger with address=169.254.2.0

Policy Based Routing stays the same and next hop is not updated

This got me thinking,

We don't really care about client addresses, only on NSE cidr block.

We do, however, care about the same client receiving the same ip it had to avoid the double ip scenarios.

If an NSE restarts, we want 2 things to happen:

NSE cidr block stays the same with static cidr via ENV_VAR
NSC's client addresses are preserved from ipContext and healing works as expected

I saw commits regarding keeping the same client IPs from the past week

OR

NSE cidr block stays the same with IPAM dynamic cidr generation + some kind of restore mechanism to identify each NSE
NSC's client addresses are preserved from ipContext and healing works as expected

OR

NSE gets new cidr from IPAM
Doing a healing with replacing the ips / calling .Close for new vWires to be created

Oct 06 '22 11:10 yuraxdrumz

I have another thing that needs checking.

I have 2 NSC's and 2 NSEs, all on different k8s nodes. The NSE's offer the same NetworkService.

Both NSE's have a static cidr block

env:
  - name: NSM_CIDR_PREFIX
    value: 172.16.1.100/28

The NSC's connect in a round robin manner to the NSE's.

I got a scenario, like the following:

NSC1 src: 172.16.1.97, NSE1 dst: 172.16.1.96 NSC2 src: 172.16.1.97, NSE2 dst: 172.16.1.96

Now, NSE1 goes offline / rescheduled.

NSC1 tries to connect to NSE2 with the same address as NSC2 and it causes forwarder errors until it retries enough to connect back to NSE1.

Now, if I understand correctly, even if we use dynamic CIDR generation, or static but non-overlapping CIDR's between NSE's, for example:

NSC1 src: 172.16.1.97, NSE1 dst: 172.16.1.96 NSC2 src: 172.16.2.97, NSE2 dst: 172.16.2.96

We still hit a scenario during healing, where the NSC will have 172.16.1.97, but the NSE will have CIDR 172.16.2.96.

What happens / should happen in this scenario?

Do we need to have a single CIDR block with an external store (like the IPAM we talked about) for several NSE's, or can we do healing / .Cancel on the NSC to reconnect with new address?

Oct 11 '22 14:10 yuraxdrumz

sdk sdk copied to clipboard

Healing doesn't work properly in the case of multiple NSCs

Description

Possible solution:

sdk
sdk copied to clipboard