sdk icon indicating copy to clipboard operation
sdk copied to clipboard

[R&D] Improve the recovery of NSM clients

Open denis-tingaikin opened this issue 1 year ago • 3 comments

We have done a lot of work to improve the recovery of NSM in the past. Recently we've found a few use-cases that conceptually do not work with existing code. I also noticed that the heal chain element could be simplified.

Problem scenarios

Scenario 1:

  1. Create a cluster with two nodes.
  2. Deploy NSM
  3. Deploy NSC without DP healing on node 1.
  4. Deploy NSE on node 2.
  5. Deploy NetSvc
  6. Kill forwarders on both nodes during refresh. Actual: no connection established on forwareders re-deployed

Scenario 2:

  1. Create a cluster with two nodes.
  2. Deploy NSM
  3. Deploy NSC without DP healing on node 1.
  4. Deploy NSE on node 2.
  5. Deploy NetSvc
  6. Kill any forwarder and any NSMG during the refresh. Actual: no connection established on forwareders re-deployed

Scenario 3:

  1. Create a cluster with two nodes.
  2. Deploy NSM
  3. Deploy NSC without DP healing on node 1.
  4. Deploy NSE on node 2.
  5. Deploy NetSvc
  6. Kill nse on refresh backward Actual: no connection established on forwareders re-deployed

Scenario 4:

  1. Create a cluster with two nodes.
  2. Deploy NSM
  3. Deploy NSC without DP healing on node 1.
  4. Deploy NSE on node 2.
  5. Deploy NetSvc
  6. Kill registry during refresh Actual: no connection established on forwareders re-deployed

Solution

Rework a few chain elements and reorganize the code.

Changes

  1. heal chain element: 1.1. simply starts monitoring DP/CP goroutines. 1.2. If something goes wrong with CP Monitroing, a refresh request should be scheduled. 1.3. If something goes wrong with DP monitoring, a refresh request with reselection should be scheduled.
  2. retry chain element: 2.1. should not be a wrapper. 2.2. should use begin for retries
  3. begin should be able to cancel the current request if a reselect request is called.
  4. RESELECT_REQUESTED state should be removed.

denis-tingaikin avatar Dec 11 '23 21:12 denis-tingaikin

Another problem scenario: Scenario 5:

  1. Create a cluster.
  2. Deploy NSM
  3. Deploy NSC and NSE
  4. Deploy NetSvc
  5. Kill NSE after cmd-nsc-init finished but before cmd-nsc finished Actual: connection is not restored by cmd-nsc

Description: After a connection is established by the init-container, it remains unattended for some time until cmd-nsc sends a request. During this short time, any event that disrupts the connection can occur. Apparently, this is the responsibility of the heal chain element, to monitor and restore such connections.

glazychev-art avatar Jan 11 '24 08:01 glazychev-art

We found that for begin.Request() and for eventFactory.Request() we generally use a very similar logic. Over time, the amount of code increases in two places and eventually we support 2 versions. We got an idea to use eventFactory.Request() from begin.Request(), rather than using eventFactory.executor directly.

Most likely, this can also help with modifying the retry functionality - we can use it from the eventFactory (see Changes step 2 in the Description)

Question: May we call Close() for the same connection multiple times? Or do we consider Close() only as an addition to Request() (that is the begin chain element skips Close() if the connection is not ready)?

glazychev-art avatar Jan 30 '24 13:01 glazychev-art

retry problems

I've considered adding retry after the begin chain element. And I found a few problems:

  1. retry calls next.Request() on error Problem case: retry-problem1 If refresh failed, it starts retries until success or context timeout. So, other calls eventFactory calls will be blocked by executor. What if we need to heal the connection?
  2. retry calls eventFactory on error Let's imagine that the first client Request failed. retry-Page-3 There are two problems:
  • we need to pass the initial Request somehow to the eventFactory (it doesn't have networkservice.NetworkServiceRequest argument) https://github.com/networkservicemesh/sdk/blob/53a8ec8928f596361d57430ee22f83ea70541e57/pkg/networkservice/common/begin/event_factory.go#L44
  • if schedule the retry arfter error, what should we return to the client?

Possible solution

In my opinion, the problem here is that we are calling eventFactory.Executor from two different places:

  • from begin - https://github.com/networkservicemesh/sdk/blob/v1.12.0/pkg/networkservice/common/begin/client.go#L60-L87
  • from eventFactory - https://github.com/networkservicemesh/sdk/blob/v1.12.0/pkg/networkservice/common/begin/event_factory.go#L93-L122

I would suggest:

  • using executor only in one place in the eventFactory, and begin will only use the eventFactory (not accessing the private fields) (see also - https://github.com/networkservicemesh/sdk/issues/1565#issuecomment-1916838489)
  • next step is to add retry as an option to eventFactory. This way we will manage retry from one place

glazychev-art avatar Feb 20 '24 07:02 glazychev-art