sedna icon indicating copy to clipboard operation
sedna copied to clipboard

The deploy problem of Example Four, Federal learning surface detaction: edge keep restart

Open JasonNing96 opened this issue 3 years ago • 9 comments

  1. I have a question about the dataset deploye, It's run commend on Cloud? image
  2. My surface-defect-detection-train- is keeping restart and error between edge1 and edge 2. image When logs the pod it shown : image And docker logs shown: image Other pod was working, but the tarin-work down. And the server seen running: image

JasonNing96 avatar Oct 29 '21 06:10 JasonNing96

by the way I'm change the version of V 0.3.0 because my docker images build V0.4.0:

image

JasonNing96 avatar Oct 29 '21 07:10 JasonNing96

Here is yml I used.

kubectl create -f - <<EOF apiVersion: sedna.io/v1alpha1 kind: FederatedLearningJob metadata: name: surface-defect-detection spec: aggregationWorker: model: name: "surface-defect-detection-model" template: spec: nodeName: $CLOUD_NODE containers: - image: kubeedge/sedna-example-federated-learning-surface-defect-detection-aggregation:v0.4.0 name: agg-worker imagePullPolicy: IfNotPresent env: # user defined environments - name: "exit_round" value: "3" resources: # user defined resources limits: memory: 2Gi trainingWorkers: - dataset: name: "edge1-surface-defect-detection-dataset" template: spec: nodeName: $EDGE1_NODE containers: - image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.4.0 name: train-worker imagePullPolicy: IfNotPresent env: # user defined environments - name: "batch_size" value: "32" - name: "learning_rate" value: "0.001" - name: "epochs" value: "2" resources: # user defined resources limits: memory: 2Gi - dataset: name: "edge2-surface-defect-detection-dataset" template: spec: nodeName: $EDGE2_NODE containers: - image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.4.0 name: train-worker imagePullPolicy: IfNotPresent env: # user defined environments - name: "batch_size" value: "32" - name: "learning_rate" value: "0.001" - name: "epochs" value: "2" resources: # user defined resources limits: memory: 2Gi EOF

JasonNing96 avatar Oct 29 '21 07:10 JasonNing96

@JoeyHwong-gk

llhuii avatar Oct 29 '21 07:10 llhuii

@JasonNing96 try newer version: v0.4.2

llhuii avatar Oct 29 '21 07:10 llhuii

I followed by the online installe page, it should be the lastest version, right ? Or Install local I will try image

JasonNing96 avatar Oct 29 '21 08:10 JasonNing96

I means try example version v0.4.2.

I just tried kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.4.0 is OK, but the image kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.4.2@sha256:47fd842ce9947 reported the following error:

[INFO][08:27:05]: Client: simple
[INFO][08:27:05]: Trainer: basic
[INFO][08:27:05]: Algorithm: fedavg
Traceback (most recent call last):
  File "train.py", line 60, in <module>
    main()
  File "train.py", line 57, in main
    fl_model.run()
AttributeError: 'FederatedLearningV2' object has no attribute 'run'

llhuii avatar Oct 29 '21 08:10 llhuii

by the way I'm change the version of V 0.3.0 because my docker images build V0.4.0:

image

I think you don't need to build the example image by youself.

llhuii avatar Oct 29 '21 08:10 llhuii

I means try example version v0.4.2.

I just tried kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.4.0 is OK, but the image kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.4.2@sha256:47fd842ce9947 reported the following error:

[INFO][08:27:05]: Client: simple
[INFO][08:27:05]: Trainer: basic
[INFO][08:27:05]: Algorithm: fedavg
Traceback (most recent call last):
  File "train.py", line 60, in <module>
    main()
  File "train.py", line 57, in main
    fl_model.run()
AttributeError: 'FederatedLearningV2' object has no attribute 'run'

@jaypume @XinYao1994 please take a look

llhuii avatar Oct 29 '21 08:10 llhuii

Maybe fl_model.train() should be used here instead of fl_model.run() , and we will fix it ASAP.

jaypume avatar Oct 29 '21 08:10 jaypume