LC throws an panic: runtime error: invalid memory address or nil pointer dereference
What happened:
LC throws an panic and exits when dog-croissants-classification training worker completed in incremental-learning example
[root@board2 ~]# docker logs -f --tail 100 k8s_lc_lc-rbp8g_sedna_b1fe2038-9743-467d-add0-e0b4bc46714f_0 ……………… I0111 02:17:01.570555 1 incrementallearningjob.go:208] job(default/incrementallearningjob/incremental-learning-dog-croissants-classification) completed the Training phase triggering task successfully
E0111 02:17:51.467939 1 incrementallearningjob.go:472] job(default/incrementallearningjob/incremental-learning-dog-croissants-classification) failed to complete the Eval task: failed to sync deploy model, and waiting it: not exists model(name=default/model/incremental-learning-deploy-model-dog-croissants-classification) I0111 02:18:01.468495 1 panic.go:965] incremental learning job(default/incrementallearningjob/incremental-learning-dog-croissants-classification) was stopped panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x14884b9]
goroutine 46167 [running]: github.com/kubeedge/sedna/pkg/localcontroller/managers/incrementallearning.(*Manager).triggerEvalTask(0xc0003e4440, 0xc00022c400, 0x0, 0x0, 0x4aedb3) /code/pkg/localcontroller/managers/incrementallearning/incrementallearningjob.go:864 +0x79 github.com/kubeedge/sedna/pkg/localcontroller/managers/incrementallearning.(*Manager).evalTask(0xc0003e4440, 0xc00022c400, 0x0, 0x0) /code/pkg/localcontroller/managers/incrementallearning/incrementallearningjob.go:250 +0x394 github.com/kubeedge/sedna/pkg/localcontroller/managers/incrementallearning.(*Manager).startJob(0xc0003e4440, 0xc000222540, 0x51) /code/pkg/localcontroller/managers/incrementallearning/incrementallearningjob.go:459 +0x611 created by github.com/kubeedge/sedna/pkg/localcontroller/managers/incrementallearning.(*Manager).Insert /code/pkg/localcontroller/managers/incrementallearning/incrementallearningjob.go:500 +0x3ae [root@board2 ~]#
logs of training worker
[root@board2 ~]# docker logs -f k8s_train-worker_incremental-learning-dog-croissants-classification-train-wt7nl_default_55ae2bd9-802b-4c07-b588-1dcfa69d3de9_0 /home/lib/sedna/backend/init.py:50: UserWarning: MINDSPORE Not Support yet, use itself warnings.warn(f"{backend_type} Not Support yet, use itself") [WARNING] ME(1:139923579950592,MainProcess):2023-01-11-02:17:06.685.1 [mindspore/train/serialization.py:714] For 'load_param_into_net', 2 parameters in the 'net' are not loaded, because they are not in the 'parameter_dict', please check whether the network structure is consistent when training and loading checkpoint. [WARNING] ME(1:139923579950592,MainProcess):2023-01-11-02:17:06.696.8 [mindspore/train/serialization.py:716] head.classifier.weight is not loaded. [WARNING] ME(1:139923579950592,MainProcess):2023-01-11-02:17:06.702.1 [mindspore/train/serialization.py:716] head.classifier.bias is not loaded. [WARNING] ME(1:139923579950592,MainProcess):2023-01-11-02:17:06.128.592 [mindspore/train/model.py:1078] For ValAccMonitor callback, {'epoch_end', 'end'} methods may not be supported in later version, Use methods prefixed with 'on_train' or 'on_eval' instead when using customized callbacks. num_parallel_workers=2 train_dataset_url /home/data/sedna/incremental_learning/dog_croissants_classification/dataset/data/dog_croissants/train valid_dataset_urlL : /home/data/sedna/incremental_learning/dog_croissants_classification/dataset/data/dog_croissants/val Delete parameter from checkpoint: head.classifier.weight Delete parameter from checkpoint: head.classifier.bias Delete parameter from checkpoint: moments.head.classifier.weight Delete parameter from checkpoint: moments.head.classifier.bias
Epoch: [ 1 / 1], Train Loss: [0.699], Accuracy: 0.875 Train epoch time: 44355.729 ms, per step time: 343.843 ms End of validation the best Accuracy is: 0.875, save the best ckpt file in /home/data/sedna/incremental_learning/dog_croissants_classification/output/train/1/best.ckpt train_phase_done
What you expected to happen: start to eval How to reproduce it (as minimally and precisely as possible): Use the main branch, edit incremental-learning-dog-croissants-classification.Dockerfile as it: FROM mindspore/mindspore-cpu:1.7.1 -> FROM mindspore/mindspore-cpu:1.9.0 build the image,and it's easy to reproduce it
Anything else we need to know?:
[root@board1 ~]# kubectl get dataset incremental-learning-dataset-dog-croissants-classification -oyaml apiVersion: sedna.io/v1alpha1 kind: Dataset metadata: creationTimestamp: "2023-01-06T07:43:00Z" generation: 1 name: incremental-learning-dataset-dog-croissants-classification namespace: default resourceVersion: "2324607" uid: 5d7d9e4d-a4ab-4f94-b7eb-a97951d9f62f spec: format: txt nodeName: board2 url: /sedna/incremental_learning/dog_croissants_classification/dataset/data/dog_croissants/train_data.txt status: numberOfSamples: 470 updateTime: "2023-01-11T02:28:09Z"
Environment:
Sedna Version
$ kubectl get -n sedna deploy gm -o jsonpath='{.spec.template.spec.containers[0].image}'
kubeedge/sedna-gm:v0.5.1
$ kubectl get -n sedna ds lc -o jsonpath='{.spec.template.spec.containers[0].image}'
kubeedge/sedna-lc:v0.5.1
Kubernets Version
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T18:03:20Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T17:57:25Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
KubeEdge Version
$ cloudcore --version
Version: v1.12.1
$ edgecore --version
Version: v1.12.1