eve icon indicating copy to clipboard operation
eve copied to clipboard

Avoid endless loop in DPCManager.runVerify when we get RemoteTemporaryFailure

Open milan-zededa opened this issue 6 months ago • 8 comments

Description

This PR fixes a bug in DpcManager.runVerify that would cause an infinite loop when a RemoteTemporaryFailure was returned by the ConnectivityTester.

The issue was that DPCManager didn’t flag the DPC as tested when a RemoteTemporaryFailure was received. Such a DPC was then mistakenly evaluated as a new DPC received during testing, and the entire testing procedure was repeated - resulting in an endless cycle.

This commit fixes the issue by flagging the DPC as tested, and actually as successful, reporting the RemoteTemporaryFailure only as a warning, since connectivity itself is working.

There is no need to handle testing retries for RemoteTemporaryFailure explicitly, as they are already performed periodically every 5 minutes (timer.port.testinterval), even when the latest DPC is marked as successful.

How to test and validate this PR

For reproduction, I changed the controller handler /api/v2/edgedevice/ping to return 500. Another option is to install controller certificate which is not valid anymore (current date is > certificate.notAfter). Without the fix, diag would continuously show testing as ongoing:

cat /run/diag.out 
...
WARNING: The configuration below is under test hence might report failures
WARNING: state DPC_REMOTE_WAIT not SUCCESS
...

And running logread -f | grep "DPC verify" would reveal that DPC verification is in a constant loop (logs keep streaming).

With the fix, testing should end with DPC_REMOTE_WAIT:

cat /run/diag.out 
...
INFO: Using highest priority DevicePortConfig key zedagent
WARNING: state DPC_REMOTE_WAIT not SUCCESS
...

And logread -f | grep "DPC verify" should be quiet most of the time (only once per 5 minutes logs should show up for a periodic re-testing).

You can also check that the latest DPC is being used (CurrentIndex is zero) and it is flagged with only warning:

$ cat /persist/status/nim/DevicePortConfigList/global.json | jq
{
  "CurrentIndex": 0,
  "PortConfigList": [
    {
      "Version": 1,
      "Key": "zedagent",
      "TimePriority": "2025-06-27T07:08:40.141226832Z",
      "State": 7,
      "ShaFile": "",
      "ShaValue": null,
      "LastFailed": "0001-01-01T00:00:00Z",
      "LastSucceeded": "2025-06-27T07:08:41.312005178Z",
      "LastError": "",
      "LastWarning": "Remote temporary failure (endpoint: mydomain.adam:3333): All attempts to connect to mydomain.adam:3333/api/v2/edgedevice/ping failed: send via eth0: SendOnIntf to https://mydomain.adam:3333/api/v2/edgedevice/ping reqlen 0 statuscode 500 Internal Server Error body:\n; send via eth2: link not found for interface eth2",
...

Changelog notes

Avoid endless loop when we get 5XX error from the controller

PR Backports

This is quite serious bug, so it should go to every active LTS. We decided to backport this even to 11.0-stable.

  • 14.5-stable
  • 13.4-stable
  • 11.0-stable

Checklist

  • [x] I've provided a proper description
  • [ ] I've added the proper documentation
  • [x] I've tested my PR on amd64 device
  • [ ] I've tested my PR on arm64 device
  • [x] I've written the test verification instructions
  • [x] I've set the proper labels to this PR
  • [x] I've checked the boxes above, or I've provided a good reason why I didn't check them.

milan-zededa avatar Jun 27 '25 08:06 milan-zededa

For reproduction, I changed the controller handler

Isn't there a way to reproduce it without changing the code? When can this situation happen in reality? I think changing code is not an option for external manual verification... Or would you add an Eden test for that?

OhmSpectator avatar Jun 27 '25 08:06 OhmSpectator

For reproduction, I changed the controller handler

Isn't there a way to reproduce it without changing the code? When can this situation happen in reality? I think changing code is not an option for external manual verification... Or would you add an Eden test for that?

Another option is to install controller certificate which is not valid anymore (current date is > certificate.notAfter). For me it was easier to change the code and return 500 - a simulated internal server error, i.e. controller bug, which is also a real case. But tester can try with an expired certificate. I added this option to PR description.

milan-zededa avatar Jun 27 '25 08:06 milan-zededa

Just in case, I already see this:

This branch has conflicts that must be resolved

OhmSpectator avatar Jun 27 '25 11:06 OhmSpectator

Just in case, I already see this:

This branch has conflicts that must be resolved

Fixed

milan-zededa avatar Jun 27 '25 12:06 milan-zededa

LGTM

rene avatar Jun 27 '25 12:06 rene

LGTM

Oh nooo, the actions were cancelled... I guess the problem is that when we leave a comment from the Review section, and it's not an Approve comment, our Actions are triggered (because of the review event), but do not proceed with the tests, as it is not a review approval...

OhmSpectator avatar Jun 27 '25 12:06 OhmSpectator

LGTM

Oh nooo, the actions were cancelled... I guess the problem is that when we leave a comment from the Review section, and it's not an Approve comment, our Actions are triggered (because of the review event), but do not proceed with the tests, as it is not a review approval...

As far as I understand, with workflow-level concurrency that we have in the Eden worflow, any new run for the same pull request immediately cancelled in‑progress runs. A non‑approval review triggers the workflow, but all job steps are gated by github.event.review.state == 'approved' (and we get 'commented'), so the run ends quickly with no tests. Because concurrency is evaluated before these job conditions, that empty run still cancels the running tests.

OhmSpectator avatar Jun 27 '25 12:06 OhmSpectator

LGTM

Oh nooo, the actions were cancelled... I guess the problem is that when we leave a comment from the Review section, and it's not an Approve comment, our Actions are triggered (because of the review event), but do not proceed with the tests, as it is not a review approval...

but exactly this time I didn't approve because I thought it would cancel the triggers, how come?

rene avatar Jun 27 '25 13:06 rene

There is a difficult and unusual case to think about. A proxy configured for one management port and the proxy returns ECON REFUSED. But an older DPC is using a different management port (eg wwan0) without a proxy or the same management port and a different proxy config. In that case we do need to try the older lower priority DPC at some point in time. With these changes will we do that? Trying after 5 minutes might be ok.

What you're suggesting was never part of the implemented logic, but it could be a useful enhancement for the future. Currently (and also with this patch) we treat a DPC returning ECONNREFUSED as successful from connectivity standpoint, so we don’t fall back to older DPCs. Instead, we keep retrying the same DPC periodically until the remote error resolves. The purpose of this patch is to prevent immediate retries and NIM getting stuck in a loop, eventually causing watchdog to reboot the device. Instead, we now rely on timer-triggered retries.

milan-zededa avatar Jun 29 '25 07:06 milan-zededa

Milan, do you think we are good to merge? I saw all tests green.

OhmSpectator avatar Jun 29 '25 08:06 OhmSpectator

Milan, do you think we are good to merge? I saw all tests green.

Yes, PR is ready for merge.

milan-zededa avatar Jun 29 '25 21:06 milan-zededa

Yay! Waiting for the backports now

OhmSpectator avatar Jun 30 '25 08:06 OhmSpectator

Yay! Waiting for the backports now

I will backport this together with https://github.com/lf-edge/eve/pull/5026 (except for 11.0) since git is not able to merge their tests without conflicts.

milan-zededa avatar Jun 30 '25 11:06 milan-zededa