mindyolo
mindyolo copied to clipboard
使用教程提供数据集训练yolov7模型加载出现问题。
1.环境配置:modelart(mindspore_1.10.0-cann_6.0.1-py_3.7-euler_2.8.3镜像) ,EulerOS 2.0 (SP8), CANN-6.0.1,mindspore1.10, mindyolo r0.1。 2.数据集制作及训练过程文档:https://github.com/mindspore-lab/mindyolo/blob/master/examples/finetune_SHWD/README.md 3.训练过程中出现报错: RuntimeError: For 'load_param_into_net', model.model.77.m.0.weight in the argument 'net' should have the same shape as model.model.77.m.0.weight in the argument 'parameter_dict'. But got its shape (21, 128, 1, 1) in the argument 'net' and shape (255, 128, 1, 1) in the argument 'parameter_dict'.May you need to check whether the checkpoint you loaded is correct or the batch size and so on in the 'net' and 'parameter_dict' are same. 请问如何让解决。 [注:日志文件见附件。 outputlog.txt
1.环境配置:modelart(mindspore_1.9.0-cann_6.0.0-py_3.7-euler_2.8.3), EulerOS 2.0 (SP8), CANN-6.0.RC1,mindyolo r0.1。
2.参考Master分支数据集自建方式(https://github.com/mindspore-lab/mindyolo/tree/master/examples/finetune_SHWD)自建数据集训练模型,训练用的mindyolo_r0.1分支。
3.配置文件: BASE: [ '/home/ma-user/work/mindyolo-r0.1/configs/yolov8/yolov8n.yaml', ]
per_batch_size: 16 # 16 * 8 = 128 img_size: 640 # image sizes weight: /home/ma-user/work/mindyolo-r0.1/pre-ckpt/yolov8-n_500e_mAP372-cc07f5bd.ckpt strict_load: False
data: dataset_name: shwd train_set: /home/ma-user/work/mindyolo-r0.1/dataset-test/SHWD/train.txt val_set: /home/ma-user/work/mindyolo-r0.1/dataset-test/SHWD/val.txt test_set: /home/ma-user/work/mindyolo-r0.1/dataset-test/SHWD/val.txt nc: 3
names: [ 'helmet', 'gloves', 'shawl' ]
optimizer: lr_init: 0.001 # initial learning rate
3.训练过程(模型加载)中出现报错,利用yolov8n、yolov7-tiny、yolov5n预训练模型训练,都出现了模型加载错误:
yolov8n:
[CRITICAL] ME(22510:281472828627520,MainProcess):2024-03-02-14:35:39.167.353 [mindspore/train/serialization.py:112] Failed to combine the net and the parameters for param model.model.22.cv3.0.0.conv.weight.
Traceback (most recent call last):
File "train.py", line 290, in
yolov7-tiny:
[CRITICAL] ME(44701:281473522788928,MainProcess):2024-03-02-14:51:05.733.431 [mindspore/train/serialization.py:112] Failed to combine the net and the parameters for param model.model.77.m.0.weight.
Traceback (most recent call last):
File "train.py", line 290, in
yolov5n:
[CRITICAL] ME(49608:281473261058624,MainProcess):2024-03-02-14:53:49.428.28 [mindspore/train/serialization.py:112] Failed to combine the net and the parameters for param model.model.24.m.0.weight.
Traceback (most recent call last):
File "train.py", line 290, in
4.在不启用预训练模型模型情况下,可执行训练程序: 2024-03-02 15:21:02,162 [INFO] Epoch 6/300, Step 39/39, step time: 1896.07 ms 2024-03-02 15:21:02,871 [INFO] Saving model to ./runs/2024.03.02-15:10:50/weights/yolov5n_shwd-6_39.ckpt 2024-03-02 15:21:02,872 [INFO] Epoch 6/300, epoch time: 1.24 min. 2024-03-02 15:22:16,137 [WARNING] overflow, still update, loss scale adjust to 1024.0 2024-03-02 15:22:16,147 [INFO] Epoch 7/300, Step 39/39, imgsize (640, 640), loss: 0.2346, lbox: 0.0723, lobj: 0.0548, lcls: 0.1075, cur_lr: 0.0009768999880179763 2024-03-02 15:22:16,149 [INFO] Epoch 7/300, Step 39/39, step time: 1878.87 ms 2024-03-02 15:22:16,761 [INFO] Saving model to ./runs/2024.03.02-15:10:50/weights/yolov5n_shwd-7_39.ckpt 2024-03-02 15:22:16,762 [INFO] Epoch 7/300, epoch time: 1.23 min. 2024-03-02 15:23:31,967 [WARNING] overflow, still update, loss scale adjust to 1024.0 2024-03-02 15:23:31,977 [INFO] Epoch 8/300, Step 39/39, imgsize (640, 640), loss: 0.2195, lbox: 0.0681, lobj: 0.0481, lcls: 0.1034, cur_lr: 0.0009736000210978091 2024-03-02 15:23:31,979 [INFO] Epoch 8/300, Step 39/39, step time: 1928.60 ms 2024-03-02 15:23:32,630 [INFO] Saving model to ./runs/2024.03.02-15:10:50/weights/yolov5n_shwd-8_39.ckpt 2024-03-02 15:23:32,631 [INFO] Epoch 8/300, epoch time: 1.26 min.
请问老师如何解决预训练模型无法载入问题?
yolov7-tiny: RuntimeError: For 'load_param_into_net', model.model.77.m.0.weight in the argument 'net' should have the same shape as model.model.77.m.0.weight in the argument 'parameter_dict'. But got its shape (24, 128, 1, 1) in the argument 'net' and shape (255, 128, 1, 1) in the argument 'parameter_dict'.May you need to check whether the checkpoint you loaded is correct or the batch size and so on in the 'net' and 'parameter_dict' are same.
我这边训练的类别数是3(3*(3+5)=24)个,预训练模型类别是80(3*(80+5)=255)。导致了shape不一致,但是在训练过程中我修改了配置,是可以丢掉最后一层的权重(shape)。
但是出现了错误,这种情况下应该如何改进呢?
看报错应该是模型结构和权重shape不一致 可能是修改了最后层分类数导致的
1.环境配置:modelart(mindspore_1.10.0-cann_6.0.1-py_3.7-euler_2.8.3镜像) ,EulerOS 2.0 (SP8), CANN-6.0.1,mindspore1.10, mindyolo r0.1。 2.数据集制作及训练过程文档:https://github.com/mindspore-lab/mindyolo/blob/master/examples/finetune_SHWD/README.md 3.训练过程中出现报错: RuntimeError: For 'load_param_into_net', model.model.77.m.0.weight in the argument 'net' should have the same shape as model.model.77.m.0.weight in the argument 'parameter_dict'. But got its shape (21, 128, 1, 1) in the argument 'net' and shape (255, 128, 1, 1) in the argument 'parameter_dict'.May you need to check whether the checkpoint you loaded is correct or the batch size and so on in the 'net' and 'parameter_dict' are same. 请问如何让解决。 [注:日志文件见附件。 outputlog.txt
权重加载的逻辑是在这个地方进行的 可以尝试在这个函数调试下看看 https://github.com/mindspore-lab/mindyolo/blob/master/mindyolo/utils/utils.py#L113