sdk
sdk copied to clipboard
Healing doesn't work properly in the case of multiple NSCs
Description
Related to: https://github.com/networkservicemesh/sdk/issues/1357
Currently, healing doesn't work properly if we have several NSCs, because when the NSE is restarted, point2pointipam
loses information about the issued IP addresses.
After the restart, the order of servicing and issuing addresses during the healing process may not coincide with the initial request (before death).
For example:
- We have 2 successfully connected NSC:
- NSE was restarted. NSCs started healing but NSC-2 sent Request before NSC-1. The restarted NSE issues new addresses in the order of requests. Since we're storing the previous addresses in the Request, we'll end up with:
- Datapath healing is down, because NSC-1 cannot ping 172.16.1.0 anymore
Possible solution:
We can try to recover IP addresses from the Request inside point2pointipam
.
In this case, clients will only have old addresses.
@edwarnicke @denis-tingaikin Any thoughts?
@glazychev-art Is this an issue of both IP addresses being in the response to the Request?
I can see this happening one of two ways either:
- NSC1 sends Request.Connection.ConnectionContext.IPContext.SrcIP list is sent with 172.16.1.1 and then point2pointipam adds 172.16.1.3 to the list and both IPs are returned to the NSC or
- NSC1 sends Request.Connection.ConnectionContext.IPContext.SrcIP is sent with 172.16.1.1 and then point2pointipam replaces it with 172.16.1.3 and one IP address is sent back to NSC1, but the Forwarder fails to clean up the previous IP address.
Which is it? Or is it something else?
@edwarnicke Yes, the first way is correct. And also the fact that Request.Connection.ConnectionContext.IPContext.DstIP also contains two addresses - 172.16.1.0 (before healing) and 172.16.1.2 (after healing). But 172.16.1.0 is already wrong in this situation. And datapath healing therefore does not work.
I think it would be great to use old addresses after healing.
I think it would be great to use old addresses after healing.
@glazychev-art That was what I was thinking. There's a tricky bit in there though because right now point2pointipam leaves space for other point2pointipam instances to add IPs. So it can't really tell the difference between "This is a new IP" and "Someone before me in the chain set this IP as part of IPAM".
@edwarnicke You are right. Maybe we should at least try to recover the addresses? The logic could be like this: 1. NSE has a couple of IPAMs:
ipv4IPAM - point2pointipam(ipv4CIDR)
ipv6IPAM - point2pointipam(ipv6CIDR)
2A. If NSE receives the first request from the client, ipv4IPAM adds ipv4 addresses to Request.Connection.ConnectionContext.IPContext.DstIP/SrcIP. The next ipv6IPAM sees that IPContext
already contains something and tries to restore it using its ip-pools. If it fails (in this case it will fail), it adds its ipv6 address.
2B. If NSE receives the request after restart (healing) - it sequentially checks all IPAMs
to see if someone contains addresses from the IPcontext. If yes, we restore and use the old addresses, if not, allocate new ones.
2C. If refresh - IPAM
already contains information in its map
Briefly: if we can, we restore the address, if not, we allocate a new one.
@edwarnicke
Additional question:
Do we understand correctly that the IP addresses for the client and endpoint should not change? What if, for example, healing with reselect
?
@glazychev-art Whether the IP addresses change on heal will be up to the NSE. I expect many (most?) NSEs will want to provide IP address stability, but maybe not all.
From NSMs point of view:
- If we heal and the new NSE permits the same IP address, we should support that correctly.
- If we heal, an the new NSE changes the IP address, we should support that correctly as well.
I suspect the best answer for our current point2pointipam would be to simply compare any existing IP in the chain to the prefixes from which its issuing IPs. If the incoming Request has an IP that is in the prefixes a particular p2pipam is issuing from, it could accept that and not issue a new one.
Thoughts?
@edwarnicke
I suspect the best answer for our current point2pointipam would be to simply compare any existing IP in the chain to the prefixes from which its issuing IPs. If the incoming Request has an IP that is in the prefixes a particular p2pipam is issuing from, it could accept that and not issue a new one.
Yes, I agree. I think that's what I described here - https://github.com/networkservicemesh/sdk/issues/1358#issuecomment-1259074610
@glazychev-art Yes, but the prefix check is important.
For example:
- IPAM1 - point2pointipam(10.0.0.0/24)
- IPAM2 - point2pointipam(10.0.1.0/24)
We don't want IPAM2 deciding it doesn't have to do its work because 10.0.0.1/32 has been assigned already by IPAM1. That's why the prefix checking is important.
@edwarnicke Yes, so:
- IPAM1 adds 10.0.0.1/32
- IPAM2 checks that 10.0.0.1/32 doesn't belong to 10.0.1.0/24 and adds 10.0.1.1/32
Right?
Exactly :)
Hey,
Maybe we can add the IPAM as a CRD just like NetworkService / NetworkServiceEndpoint? Or maybe reuse the IPAM replica set, used in VL3 in conjunction with a CRD.
I would like to think the IPAM works like DHCP.
Scenario A:
NSC -> NSE, connection is new, ask registry / IPAM CR for new address, store the details like NSC id / other identifier, add expiration
Scenario B:
NSC -> NSE, refresh, connection exists, but not expired, ask registry. / IPAM CR to update expiry based on the NSC identifier
Scenario C:
NSC -> NSE, refresh, connection exists, but it is expired, return CONNECTIONEXPIREDERROR, so that the NSC can do .Close
.
I am not very familiar with the entire healing / monitor mechanism yet, but I think trying to lease a new ip when old one expires, can lead to issues and calling .Close
to create a new data path seems safer here.
The same applies in VL3 if I'm not mistaken. The IPAM replica set has a CIDR, let's say x.x.x.x/16
, and NSE's ask for a x.x.x.x/24
. We still encounter the same issue here when using multiple NSC's.
Thanks
I think it is possible to make IPAM like CRD and it allows us remove https://github.com/networkservicemesh/cmd-ipam-vl3
All concurrency problems are resolvable with k8s.
Only one question here: How we'll support IPAM for non-k8s systems? :)
@yuraxdrumz It would totally be possible to add a crd based ipam chain element. We do need to be cautious to maintain generalizability though because:
- NSM runs in non-K8s spaces
- NSEs will have wildly varying opinions about how they want to manage IPAM.
So for example having a crdipam chain element would be a fantastic tool for folks to have (and easy to write in sdk-k8s), making it the 'one true ipam' is probably suboptimal :)
@denis-tingaikin @edwarnicke
I agree we need to keep it general. I thought about reusing the same design as with registry.
By default, we use registry-k8s, but we also have registry-memory and people that use NSM in a non-k8s can implement the interface to interact with the registry. The same can be achieved with the IPAM.
We can have IPAM-k8s, IPAM-memory, which is basically the code we have today that the NSE use and if someone wants something custom, we have a simple interface to implement.
I am guessing that it will require another proxy for the inter-domain scenario though.
What are you're thoughts?
I think it is possible to make IPAM like CRD and it allows us remove https://github.com/networkservicemesh/cmd-ipam-vl3
All concurrency problems are resolvable with k8s.
Only one question here: How we'll support IPAM for non-k8s systems? :)
I think we can do like I suggested with the registry, we can keep the ipam and make it run as ipam-k8s for all the k8s needs, add an ipam-memory to support non-k8s systems.
The NSE's, both regular and VL3, will ask the IPAM server for a prefix, almost the same as today.
When a client connects to the NSE, the NSE will call the IPAM server again for .RegisterClient
.
Hey,
I talked about the IPAM, because the end result I am looking for is to have client and endpoint addresses unique and preserved across restarts regardless of using 1 NSC, multiple NSC's or using VL3.
I did a few steps to check some things out:
-
Fork api and add a new IPAM server V2 with new methods, like
CreateEndpoint
, which act exactly the same as ManagePrefixes currently for the in memory implementation -
Fork sdk and write the code for the
ipam/chains/memory
according to the aboveIPAM server v2
- Fork cmd-ipam-vl3 and implement a new cmd-ipam-memory
- Run the
ipam-memory
the same as thecmd-ipam-vl3
in my k8s cluster -
Fork cmd-nse-icmp-responder, add the code for communicating with
cmd-ipam-memory
Now, I see the endpoints work exactly the same as with vl3
Oct 3 14:56:30.568 [INFO] [cmd:[/bin/app]] Configuration: &{[{tcp :5006 false }] TRACE otel-collector.observability.svc.cluster.local:4317 169.254.0.0/16 24}
Oct 3 14:56:37.107 [INFO] [cmd:[/bin/app]] SVID: "spiffe://example.org/ns/nsm-system/pod/ipam-memory-84845d75f5-fr7rk/012ee2e8-147f-4702-a279-7aebd11524b4"
2022/10/06 09:25:28 [DEBUG] starting to allocate new prefix for endpoint name = nse-inet-57d959775d-vvjgf
2022/10/06 09:25:28 [DEBUG] didn't find prefix, allocating new prefix = 169.254.0.0/24
2022/10/06 09:25:28 [DEBUG] adding request.prefix = 169.254.0.0/24 to excludedPrefixes
2022/10/06 09:25:28 [DEBUG] adding response.prefix = 169.254.0.0/24 to clientsPrefixes
2022/10/06 09:25:28 [DEBUG] excluding prefix from pool, prefix = 169.254.0.0/24
2022/10/06 09:26:52 [DEBUG] starting to allocate new prefix for endpoint name = nse-inet-57d959775d-97bh2
2022/10/06 09:26:52 [DEBUG] didn't find prefix, allocating new prefix = 169.254.1.0/24
2022/10/06 09:26:52 [DEBUG] adding request.prefix = 169.254.1.0/24 to excludedPrefixes
2022/10/06 09:26:52 [DEBUG] adding response.prefix = 169.254.1.0/24 to clientsPrefixes
2022/10/06 09:26:52 [DEBUG] excluding prefix from pool, prefix = 169.254.1.0/24
2022/10/06 09:52:18 [DEBUG] starting to allocate new prefix for endpoint name = nse-inet-57d959775d-rpssj
2022/10/06 09:52:18 [DEBUG] didn't find prefix, allocating new prefix = 169.254.2.0/24
2022/10/06 09:52:18 [DEBUG] adding request.prefix = 169.254.2.0/24 to excludedPrefixes
2022/10/06 09:52:18 [DEBUG] adding response.prefix = 169.254.2.0/24 to clientsPrefixes
2022/10/06 09:52:18 [DEBUG] excluding prefix from pool, prefix = 169.254.2.0/24
My question is, how can we preserve cidrs/addresses across restarts?
I see the IPAM flow as something like the image below
For example, when an NSE restarts, I have no idea how to identify it in the IPAM.
I looked at the implementation of the VL3 IPAM and to my understanding, if I restart a VL3 NSE, it will be allocated a new CIDR from the IPAM and all clients that were connected to it remain with old addresses.
Do you think the direction I'm going for, which is having a single IPAM that controls all addresses (both NSE, like today in VL3, but for all types of endpoints, and for NSC's) with an option to fallback to original NSE in memory in case the IPAM is not available, a good one?
EDIT:
I just restarted my NSE:
- Before, the cidr block it received from IPAM was
169.254.2.0/24
- After restart, the cidr block is
169.254.3.0/24
- Client did
.Request
, which resulted in the endpoint having 2 addresses,169.254.2.0
and169.254.3.0
- Client has 2 addresses as well
169.254.2.1
and169.254.3.1
, both work. - Client pinger still tries to connect to
ipContext.DstAddreesses[0]
, which is169.254.2.0
Oct 6 12:58:03.706 [INFO] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (2.14) expiration after 9m59.713992246s at 2022-10-06 13:08:03.420491532 +0000 UTC
Oct 6 12:58:08.706 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.1) Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.2.0/32
Oct 6 12:58:08.706 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.2) Creating new pinger with address=169.254.2.0
Oct 6 12:58:09.284 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.25) Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.3.0/32
Oct 6 12:58:23.807 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.3) Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.3.0/32
Oct 6 12:58:23.807 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.4) Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.2.0/32
Oct 6 12:58:23.807 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.5) Creating new pinger with address=169.254.2.0
Oct 6 12:58:38.909 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.6) Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.3.0/32
Oct 6 12:58:38.909 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.7) Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.2.0/32
Oct 6 12:58:38.909 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.8) Creating new pinger with address=169.254.2.0
Oct 6 12:58:54.011 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.9) Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.3.0/32
Oct 6 12:58:54.011 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.10) Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.2.0/32
Oct 6 12:58:54.011 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.11) Creating new pinger with address=169.254.2.0
Oct 6 12:59:09.113 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.12) Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.3.0/32
Oct 6 12:59:09.113 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.13) Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.2.0/32
Oct 6 12:59:09.113 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.14) Creating new pinger with address=169.254.2.0
Oct 6 12:59:24.215 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.15) Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.3.0/32
Oct 6 12:59:24.216 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.16) Parsing CIDR from conn.GetContext().GetIpContext().GetDstIpAddrs(), cidr=169.254.2.0/32
Oct 6 12:59:24.216 [DEBU] [id:ovpn-openvpn-5db5d7b6b8-lgkpk-0] [type:networkService] (5.17) Creating new pinger with address=169.254.2.0
- Policy Based Routing stays the same and next hop is not updated
This got me thinking,
We don't really care about client addresses, only on NSE cidr block.
We do, however, care about the same client receiving the same ip it had to avoid the double ip scenarios.
If an NSE restarts, we want 2 things to happen:
- NSE cidr block stays the same with static cidr via ENV_VAR
- NSC's client addresses are preserved from ipContext and healing works as expected
I saw commits regarding keeping the same client IPs from the past week
OR
- NSE cidr block stays the same with IPAM dynamic cidr generation + some kind of restore mechanism to identify each NSE
- NSC's client addresses are preserved from ipContext and healing works as expected
OR
- NSE gets new cidr from IPAM
- Doing a healing with replacing the ips / calling
.Close
for new vWires to be created
I have another thing that needs checking.
I have 2 NSC's and 2 NSEs, all on different k8s nodes. The NSE's offer the same NetworkService.
Both NSE's have a static cidr block
env:
- name: NSM_CIDR_PREFIX
value: 172.16.1.100/28
The NSC's connect in a round robin manner to the NSE's.
I got a scenario, like the following:
NSC1 src: 172.16.1.97, NSE1 dst: 172.16.1.96 NSC2 src: 172.16.1.97, NSE2 dst: 172.16.1.96
Now, NSE1 goes offline / rescheduled.
NSC1 tries to connect to NSE2 with the same address as NSC2 and it causes forwarder errors until it retries enough to connect back to NSE1.
Now, if I understand correctly, even if we use dynamic CIDR generation, or static but non-overlapping CIDR's between NSE's, for example:
NSC1 src: 172.16.1.97, NSE1 dst: 172.16.1.96 NSC2 src: 172.16.2.97, NSE2 dst: 172.16.2.96
We still hit a scenario during healing, where the NSC will have 172.16.1.97, but the NSE will have CIDR 172.16.2.96.
What happens / should happen in this scenario?
Do we need to have a single CIDR block with an external store (like the IPAM we talked about) for several NSE's, or can we do healing / .Cancel
on the NSC to reconnect with new address?