openyurt icon indicating copy to clipboard operation
openyurt copied to clipboard

[BUG] edgex-device-virtual pod error

Open zijin520 opened this issue 1 year ago • 6 comments

What happened: edgex-device-virtual deployed failed image

kubectl logs edgex-device-virtual-xian-lfrts-779bf745f6-vh5x5

level=ERROR ts=2023-12-25T07:58:57.933476281Z app=device-virtual source=clients.go:74 msg="unable to Get service endpoint for 'core-metadata': no matching service endpoint found. Giving up"

What you expected to happen: edgex-device-virtual runs successfully

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • OpenYurt version: v1.4.0
  • Kubernetes version (use kubectl version): v1.26.11
  • OS (e.g: cat /etc/os-release): ubuntu18.04
  • Kernel (e.g. uname -a):Linux master 5.4.0-152-generic
  • Install tools:
  • Others:

others

/kind bug

zijin520 avatar Dec 25 '23 08:12 zijin520

@LavenderQAQ PTAL

rambohe-ch avatar Dec 26 '23 02:12 rambohe-ch

Hi, could you please send the relevant configuration files, including the deployed version, etc., to help me reproduce the issue?

Rui-Gan avatar Dec 26 '23 14:12 Rui-Gan

Thanks for your attention. I suspect it's a redis or core-metadata connection problem. I deleted and recreate the components many times, now all the pod are normal. image

But there is still another question I want to confirmed with you. According the guide "https://openyurt.io/zh/docs/next/user-manuals/iot/edgex-foundry", deivceservice,device,deviceprofile will be started automatically after edgex-device-virtual deployment , but I cannot see them in my cluster. image

$ kubectl get device
No resources found in default namespace.
$ kubectl get deviceprofile
No resources found in default namespace.

zijin520 avatar Dec 27 '23 01:12 zijin520

@zijin520

  • For the device and deviceprofile issues, this is due to the yurt-iot-dock version issue, currently our api support for edgex v3 is not stable, the latest version has some issues after an update (this will be resolved after pr #1850 is merged). You are advised to use the stable v2 version.
  • For the device-virtual connection loss problem, we found that it was caused by edgex-core-metadata not being registered with consul, which may be due to startup order problems, we need to take some time to find the root cause. The easiest way to do this is to just kill metadata or consul.

LavenderQAQ avatar Dec 27 '23 14:12 LavenderQAQ

@zijin520

  • For the device and deviceprofile issues, this is due to the yurt-iot-dock version issue, currently our api support for edgex v3 is not stable, the latest version has some issues after an update (this will be resolved after pr feat: support v3 rest api client for edgex v3 api #1850 is merged). You are advised to use the stable v2 version.
  • For the device-virtual connection loss problem, we found that it was caused by edgex-core-metadata not being registered with consul, which may be due to startup order problems, we need to take some time to find the root cause. The easiest way to do this is to just kill metadata or consul.

Thanks for your reply, very clear!

zijin520 avatar Dec 28 '23 01:12 zijin520

The root cause of this is that yurtappset might launch multiple identical components (like two consuls or two core-data) in the same nodepool, and then eventually only one is guaranteed to remain (because the yaml file specified one). This leads to a lot of complications, such as the serious case that core-data is registered with consul A, but consul B is the one that is retained, then core-data will not actually perceive this (it will only think: My consul is disconnected, I should keep retrying until my consul restarts). There's a 50% chance that this will happen to any of the edgex components (which is pretty bad, since they won't be aware of it, and the pod state will be running), and we'll have to rebuild it manually. I might need to open another issue to explain and track it down, and if starting two running pods at the same time is the behavior of Deployment, solving this thing might be difficult (This might require making sure consul is the component that gets started first). This would also require looking into the registration logic of the various components of edgex, which I might open another issue to elaborate on. Currently, this problem only occurs at initialization time and can be solved by manually deleting and rebuilding the relevant components. /cc @rambohe-ch @zyjhtangtang

LavenderQAQ avatar Feb 03 '24 12:02 LavenderQAQ

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar May 03 '24 18:05 stale[bot]

This issue had been completely solved in #2029.

LavenderQAQ avatar May 08 '24 09:05 LavenderQAQ