sdk
sdk copied to clipboard
[R&D] Improve the recovery of NSM clients
We have done a lot of work to improve the recovery of NSM in the past. Recently we've found a few use-cases that conceptually do not work with existing code. I also noticed that the heal
chain element could be simplified.
Problem scenarios
Scenario 1:
- Create a cluster with two nodes.
- Deploy NSM
- Deploy NSC without DP healing on node 1.
- Deploy NSE on node 2.
- Deploy NetSvc
- Kill forwarders on both nodes during refresh. Actual: no connection established on forwareders re-deployed
Scenario 2:
- Create a cluster with two nodes.
- Deploy NSM
- Deploy NSC without DP healing on node 1.
- Deploy NSE on node 2.
- Deploy NetSvc
- Kill any forwarder and any NSMG during the refresh. Actual: no connection established on forwareders re-deployed
Scenario 3:
- Create a cluster with two nodes.
- Deploy NSM
- Deploy NSC without DP healing on node 1.
- Deploy NSE on node 2.
- Deploy NetSvc
- Kill nse on refresh backward Actual: no connection established on forwareders re-deployed
Scenario 4:
- Create a cluster with two nodes.
- Deploy NSM
- Deploy NSC without DP healing on node 1.
- Deploy NSE on node 2.
- Deploy NetSvc
- Kill registry during refresh Actual: no connection established on forwareders re-deployed
Solution
Rework a few chain elements and reorganize the code.
Changes
-
heal
chain element: 1.1. simply starts monitoring DP/CP goroutines. 1.2. If something goes wrong with CP Monitroing, a refresh request should be scheduled. 1.3. If something goes wrong with DP monitoring, a refresh request with reselection should be scheduled. -
retry
chain element: 2.1. should not be a wrapper. 2.2. should usebegin
for retries -
begin
should be able to cancel the current request if a reselect request is called. - RESELECT_REQUESTED state should be removed.
Another problem scenario: Scenario 5:
- Create a cluster.
- Deploy NSM
- Deploy NSC and NSE
- Deploy NetSvc
- Kill NSE after cmd-nsc-init finished but before cmd-nsc finished Actual: connection is not restored by cmd-nsc
Description:
After a connection is established by the init-container, it remains unattended for some time until cmd-nsc sends a request.
During this short time, any event that disrupts the connection can occur.
Apparently, this is the responsibility of the heal
chain element, to monitor and restore such connections.
We found that for begin.Request() and for eventFactory.Request() we generally use a very similar logic.
Over time, the amount of code increases in two places and eventually we support 2 versions.
We got an idea to use eventFactory.Request()
from begin.Request()
, rather than using eventFactory.executor directly.
Most likely, this can also help with modifying the retry functionality - we can use it from the eventFactory
(see Changes step 2 in the Description)
Question:
May we call Close()
for the same connection multiple times?
Or do we consider Close()
only as an addition to Request()
(that is the begin
chain element skips Close()
if the connection is not ready)?
retry problems
I've considered adding retry after the begin chain element. And I found a few problems:
-
retry
callsnext.Request()
on error Problem case:If refresh failed, it starts retries until success or context timeout. So, other calls eventFactory calls will be blocked by executor. What if we need to heal the connection?
-
retry
calls eventFactory on error Let's imagine that the first client Request failed.There are two problems:
- we need to pass the initial Request somehow to the eventFactory (it doesn't have
networkservice.NetworkServiceRequest
argument) https://github.com/networkservicemesh/sdk/blob/53a8ec8928f596361d57430ee22f83ea70541e57/pkg/networkservice/common/begin/event_factory.go#L44 - if schedule the retry arfter error, what should we return to the client?
Possible solution
In my opinion, the problem here is that we are calling eventFactory.Executor from two different places:
- from
begin
- https://github.com/networkservicemesh/sdk/blob/v1.12.0/pkg/networkservice/common/begin/client.go#L60-L87 - from
eventFactory
- https://github.com/networkservicemesh/sdk/blob/v1.12.0/pkg/networkservice/common/begin/event_factory.go#L93-L122
I would suggest:
- using executor only in one place in the
eventFactory
, andbegin
will only use theeventFactory
(not accessing the private fields) (see also - https://github.com/networkservicemesh/sdk/issues/1565#issuecomment-1916838489) - next step is to add
retry
as an option toeventFactory
. This way we will manageretry
from one place