sedna icon indicating copy to clipboard operation
sedna copied to clipboard

federated-learning-job run error: http://yolo-v5-aggregation.default:7363 connection failed

Open victorming666 opened this issue 1 year ago • 19 comments

I rebuilt the docker images for federated learning job, the pod run ok on both cloud node and edge nodes: image

the pod on cloud node: image

but the pod on edge node gives errors: image anybody can help? many tks!

victorming666 avatar Dec 26 '24 06:12 victorming666

This is an issue of DNS failure on k8s+kubedge+edgemesh+sedna cluster. The info of the cluster:

  1. kubernetes: 1.24.16
  2. kubeedge: 1.13.0
  3. edgemesh: 1.13.0
  4. sedna: 0.6.0 Two cloud nodes and two edge nodes: image Why the edge nodes can't find the dns server of cloud node? I test with edgemesh's tcp-echo examples and it works!

victorming666 avatar Dec 26 '24 07:12 victorming666

Is this project dead? Why no replies for all these issues?

victorming666 avatar Dec 26 '24 07:12 victorming666

btw, the cluster is ok. as edgemesh's test case 'cloud-edge echo' is passed: cloud call edge: image edge call cloud: image

victorming666 avatar Dec 26 '24 07:12 victorming666

At last I runned OK this test, following is the logs of cloud node: image and here is the log of one of edge nodes: image many touch stuffs...

victorming666 avatar Dec 27 '24 01:12 victorming666

Is this project dead? Why no replies for all these issues?

Congratulations on another successful bug fixing. A complicated system deployment like OpenStack, K8S, KubeEdge, or Sedna is usually for real-world cloud services and is tackled by professional experts in large enterprises, indeed not an easy task for newcomers.

Nevertheless, we see that one might be confronted with urgent issues, but when participating in the KubeEdge Community, one should try to be understanding and show respect to others, following code of conduct. Experts usually have their important duties in the company and it is also infeasible to expect a 24-Hour on-Call reply from them, e.g., two hours in this case. For successful deployers, a submission of blogs or documents is encouraged and highly appreciated to help members use Sedna within this community.

MooreZheng avatar Dec 27 '24 10:12 MooreZheng

BTW, what would be the opinion from @tangming1996 and @SherlockShemol : could there be any chance that this issue is related to the recent merge of https://github.com/kubeedge/sedna/pull/446 ?

MooreZheng avatar Dec 27 '24 10:12 MooreZheng

In my opinion it's not related to the recent PR.When I initially deploy sedna applications like joint inference and federated learning I seem to encounter the same dns problem which is caused by edgemesh.And I solve them by referencing a Q&A mannual on zhihu.Hope it helps.

SherlockShemol avatar Dec 28 '24 04:12 SherlockShemol

@MooreZheng @SherlockShemol @tangming1996 Thank you for all have done regarding this project, as AI running on Edge devices has a booming perspective. I hope this project would keep evolving. But I find it's hard to use Sedna in my own project as there is little docs for app developers. If there is a toturial guide for app development, it would be much helpful.

victorming666 avatar Dec 28 '24 06:12 victorming666

@MooreZheng @SherlockShemol @tangming1996 Thank you for all have done regarding this project, as AI running on Edge devices has a booming perspective. I hope this project would keep evolving. But I find it's hard to use Sedna in my own project as there is little docs for app developers. If there is a toturial guide for app development, it would be much helpful.

Can you provide more details about your project? Is it in the framwork of now sedna provides(joint inference, federated learning etc.).I am still a beginner in the Sedna project, so I may not be able to provide you with an answer immediately. However, during my learning process, I will pay attention to the parts you mentioned, and maybe someday I will improve it.

SherlockShemol avatar Dec 28 '24 07:12 SherlockShemol

BTW, what would be the opinion from @tangming1996 and @SherlockShemol : could there be any chance that this issue is related to the recent merge of #446 ?顺便说一句,来自和的意见是什么:这个问题有没有可能与最近的合并有关 #446 ?

It shouldn't matter, as our new feature hasn't been released yet.

tangming1996 avatar Dec 31 '24 02:12 tangming1996

@SherlockShemol we're engaging some projects to deploy some ai models on edge devices(rk3568), but the network is not stable. And we don't want share some data to cloud. So we turn to kubeedge and sedna. But it seems rather difficult to use these frameworks as the demos are just some toy-like stuffs. That's why we want help from you all.

victorming666 avatar Dec 31 '24 09:12 victorming666

It seems that the network of one edge node has a problem. You need to confirm whether the edgemesh-agent status is normal, and then confirm that all test cases can be run through. If there is still a problem, you can compare whether the configuration between the two edge nodes is consistent, because the network of one edge node is normal.

tangming1996 avatar Dec 31 '24 09:12 tangming1996

@tangming1996 yeah, only one edge node runs to completed and the other edge node hangs on to error. But no aggregated model output on the cloud node. I don't know what's wrong with this demo as the logs seems ok. It has already taken us for almost 2 weeks to test these demos. Now we are trying to integrate Sedna into our aiot project. We are facing below issues:

  1. How to integrate Sedna into a AI restful web service framework like Flask or FastAPI?
  2. How to run this Sedna AI service which can be deployed on a K8s+kubeedge+edgemesh+sedna cluster?
  3. How to observe the running progress and show the output to the customer who can verify that federated-learning or cloud-edge co-inference or incremental-learning works? In short, are there any production-level use cases of Sedna?

victorming666 avatar Jan 01 '25 15:01 victorming666

@victorming666 The aggregation of the cloud will be triggered only when the models of all edge nodes are successfully trained, because there are nodes in your environment that have problems and cannot upload the models to the cloud, resulting in the cloud being unable to complete the aggregation process.

tangming1996 avatar Jan 02 '25 02:01 tangming1996

@victorming666
video: https://www.bilibili.com/video/BV1hg4y1b78L article: https://github.com/jaypume/article/blob/main/sedna/%E8%BE%B9%E4%BA%91%E5%8D%8F%E5%90%8CAI%E6%A1%86%E6%9E%B6Sedna%E6%BA%90%E7%A0%81%E8%A7%A3%E6%9E%90/README.MD

This is a public lecture by Mr. jaypume. Through it, you can get an overview of Sedna. If you want to create applications with Sedna, you can directly go to the Sedna Lib Source Code Analysis section. Hope it helps.

SherlockShemol avatar Jan 09 '25 15:01 SherlockShemol

@SherlockShemol Thank you very much! We are scratching on the code of Sedna to figure how to integrate it into an aiot project. This quite help a lot.

victorming666 avatar Jan 10 '25 01:01 victorming666

@SherlockShemol I'm reviewing the source code of joint-inference helmet detection case. I have following questions regarding to the app developmet based on Sedna framework:

  1. I have an yolov5 object detection algorithm which can be deployed on amd64+gpu and aarch64 edge devices. This algorithm has been wrapped into a restful webservice, which is fed into images and output a json result to show the object class, bbox and confidence score;
  2. How to use Sedna to enable our customized yolov5 algorithm to joint inference dynamcially when the threshold or confidence score is below 0.85 on edge device? I'm trying to make our own Estimator object, and use it to create an instance of inference both on cloud node and edge node. I suppose these Estimator objects should be the same, but I'm wondering why the big_model/interface.py and little_model/interface.py are so different in your demo. Are there any tutorials or guides to follow to use Sedna for our customized algorithm?

victorming666 avatar Jan 16 '25 03:01 victorming666

Sorry I focus on kubernetes and know little about jointinference, maybe we can turn to Mr. @tangming1996 for help.

SherlockShemol avatar Jan 16 '25 03:01 SherlockShemol

OK, I'll try to apply the joint-inference pattern of Sedna to our ai inference webservice and test if it works. I suppose the key are the JointInference on edge node and the BigModelService on cloud node. And the Estimator on both sides should be the same, just load ai inference model(.pt on x86 and .bin on arm64) and do the predicting. Sedna would do other things like threshold config and communication via EdgeMesh to collaborate the joint-inference.

victorming666 avatar Jan 16 '25 14:01 victorming666