mindyolo
mindyolo copied to clipboard
训练yolov7在Ascend910p出错
Environment
Hardware Environment(Ascend
/GPU
/CPU
):
Uncomment only one
/device <>
line, hit enter to put that in a new line, and remove leading whitespaces from that line:
/device ascend
/device gpu
/device cpu
Software Environment:
- MindSpore version (source or binary):
- Python version (e.g., Python 3.7.5):
- OS platform and distribution (e.g., Linux Ubuntu 16.04):
- GCC/Compiler version (if compiled from source):
Describe the current behavior
Describe the expected behavior
Steps to reproduce the issue
Related log / screenshot
Special notes for this issue
python3 train.py -c configs/yolov7/yolov7.yaml
2023-07-11 12:30:32,961 [INFO] parse_args:
2023-07-11 12:30:32,961 [INFO] device_target Ascend
2023-07-11 12:30:32,961 [INFO] save_dir ./runs/2023.07.11-12.30.32
2023-07-11 12:30:32,961 [INFO] device_per_servers 8
2023-07-11 12:30:32,961 [INFO] log_level INFO
2023-07-11 12:30:32,961 [INFO] is_parallel False
2023-07-11 12:30:32,961 [INFO] ms_mode 0
2023-07-11 12:30:32,961 [INFO] ms_amp_level O0
2023-07-11 12:30:32,961 [INFO] keep_loss_fp32 True
2023-07-11 12:30:32,961 [INFO] ms_loss_scaler static
2023-07-11 12:30:32,961 [INFO] ms_loss_scaler_value 1024.0
2023-07-11 12:30:32,961 [INFO] ms_grad_sens 1024.0
2023-07-11 12:30:32,961 [INFO] ms_jit True
2023-07-11 12:30:32,961 [INFO] ms_enable_graph_kernel False
2023-07-11 12:30:32,961 [INFO] ms_datasink False
2023-07-11 12:30:32,961 [INFO] overflow_still_update True
2023-07-11 12:30:32,961 [INFO] ema True
2023-07-11 12:30:32,961 [INFO] weight
2023-07-11 12:30:32,961 [INFO] ema_weight
2023-07-11 12:30:32,961 [INFO] freeze []
2023-07-11 12:30:32,961 [INFO] epochs 300
2023-07-11 12:30:32,961 [INFO] per_batch_size 16
2023-07-11 12:30:32,961 [INFO] img_size 640
2023-07-11 12:30:32,961 [INFO] nbs 64
2023-07-11 12:30:32,961 [INFO] accumulate 1
2023-07-11 12:30:32,961 [INFO] auto_accumulate False
2023-07-11 12:30:32,961 [INFO] log_interval 100
2023-07-11 12:30:32,961 [INFO] single_cls False
2023-07-11 12:30:32,961 [INFO] sync_bn False
2023-07-11 12:30:32,961 [INFO] keep_checkpoint_max 100
2023-07-11 12:30:32,961 [INFO] run_eval False
2023-07-11 12:30:32,961 [INFO] conf_thres 0.001
2023-07-11 12:30:32,961 [INFO] iou_thres 0.65
2023-07-11 12:30:32,961 [INFO] conf_free False
2023-07-11 12:30:32,961 [INFO] rect False
2023-07-11 12:30:32,961 [INFO] nms_time_limit 20.0
2023-07-11 12:30:32,961 [INFO] recompute True
2023-07-11 12:30:32,961 [INFO] recompute_layers 5
2023-07-11 12:30:32,961 [INFO] seed 2
2023-07-11 12:30:32,961 [INFO] summary True
2023-07-11 12:30:32,961 [INFO] profiler False
2023-07-11 12:30:32,961 [INFO] profiler_step_num 1
2023-07-11 12:30:32,961 [INFO] opencv_threads_num 2
2023-07-11 12:30:32,961 [INFO] enable_modelarts False
2023-07-11 12:30:32,961 [INFO] data_url
2023-07-11 12:30:32,961 [INFO] ckpt_url
2023-07-11 12:30:32,961 [INFO] multi_data_url
2023-07-11 12:30:32,961 [INFO] pretrain_url
2023-07-11 12:30:32,961 [INFO] train_url
2023-07-11 12:30:32,961 [INFO] data_dir /cache/data/
2023-07-11 12:30:32,961 [INFO] ckpt_dir /cache/pretrain_ckpt/
2023-07-11 12:30:32,961 [INFO] data.path /home/ma-user/work/
2023-07-11 12:30:32,961 [INFO] data.train_set /home/ma-user/work/night_car/car_train.txt
2023-07-11 12:30:32,961 [INFO] data.val_set /home/ma-user/work/night_car/car_val.txt
2023-07-11 12:30:32,961 [INFO] data.test_set /home/ma-user/work/night_car/car_val.txt
2023-07-11 12:30:32,961 [INFO] data.nc 1
2023-07-11 12:30:32,961 [INFO] data.names ['car']
2023-07-11 12:30:32,961 [INFO] data.dataset_name coco
2023-07-11 12:30:32,961 [INFO] data.train_transforms [{'func_name': 'mosaic', 'prob': 1.0, 'mosaic9_prob': 0.2, 'translate': 0.2, 'scale': 0.9}, {'func_name': 'mixup', 'prob': 0.15, 'alpha': 8.0, 'beta': 8.0, 'needed_mosaic': True}, {'func_name': 'hsv_augment', 'prob': 1.0, 'hgain': 0.015, 'sgain': 0.7, 'vgain': 0.4}, {'func_name': 'pastein', 'prob': 0.15, 'num_sample': 30}, {'func_name': 'label_norm', 'xyxy2xywh_': True}, {'func_name': 'fliplr', 'prob': 0.5}, {'func_name': 'label_pad', 'padding_size': 160, 'padding_value': -1}, {'func_name': 'image_norm', 'scale': 255.0}, {'func_name': 'image_transpose', 'bgr2rgb': True, 'hwc2chw': True}]
2023-07-11 12:30:32,961 [INFO] data.test_transforms [{'func_name': 'letterbox', 'scaleup': False}, {'func_name': 'label_norm', 'xyxy2xywh_': True}, {'func_name': 'label_pad', 'padding_size': 160, 'padding_value': -1}, {'func_name': 'image_norm', 'scale': 255.0}, {'func_name': 'image_transpose', 'bgr2rgb': True, 'hwc2chw': True}]
2023-07-11 12:30:32,961 [INFO] data.num_parallel_workers 4
2023-07-11 12:30:32,961 [INFO] optimizer.optimizer momentum
2023-07-11 12:30:32,961 [INFO] optimizer.lr_init 0.01
2023-07-11 12:30:32,961 [INFO] optimizer.momentum 0.937
2023-07-11 12:30:32,961 [INFO] optimizer.nesterov True
2023-07-11 12:30:32,961 [INFO] optimizer.loss_scale 1.0
2023-07-11 12:30:32,961 [INFO] optimizer.warmup_epochs 3
2023-07-11 12:30:32,961 [INFO] optimizer.warmup_momentum 0.8
2023-07-11 12:30:32,961 [INFO] optimizer.warmup_bias_lr 0.1
2023-07-11 12:30:32,961 [INFO] optimizer.min_warmup_step 1000
2023-07-11 12:30:32,961 [INFO] optimizer.group_param yolov7
2023-07-11 12:30:32,961 [INFO] optimizer.gp_weight_decay 0.0005
2023-07-11 12:30:32,961 [INFO] optimizer.start_factor 1.0
2023-07-11 12:30:32,961 [INFO] optimizer.end_factor 0.1
2023-07-11 12:30:32,961 [INFO] optimizer.epochs 300
2023-07-11 12:30:32,961 [INFO] optimizer.nbs 64
2023-07-11 12:30:32,961 [INFO] optimizer.accumulate 1
2023-07-11 12:30:32,961 [INFO] optimizer.total_batch_size 16
2023-07-11 12:30:32,961 [INFO] loss.name YOLOv7Loss
2023-07-11 12:30:32,961 [INFO] loss.box 0.05
2023-07-11 12:30:32,961 [INFO] loss.cls 0.3
2023-07-11 12:30:32,961 [INFO] loss.cls_pw 1.0
2023-07-11 12:30:32,961 [INFO] loss.obj 0.7
2023-07-11 12:30:32,961 [INFO] loss.obj_pw 1.0
2023-07-11 12:30:32,961 [INFO] loss.fl_gamma 0.0
2023-07-11 12:30:32,961 [INFO] loss.anchor_t 4.0
2023-07-11 12:30:32,961 [INFO] loss.label_smoothing 0.0
2023-07-11 12:30:32,961 [INFO] network.model_name yolov7
2023-07-11 12:30:32,961 [INFO] network.depth_multiple 1.0
2023-07-11 12:30:32,961 [INFO] network.width_multiple 1.0
2023-07-11 12:30:32,961 [INFO] network.stride [8, 16, 32]
2023-07-11 12:30:32,961 [INFO] network.anchors [[12, 16, 19, 36, 40, 28], [36, 75, 76, 55, 72, 146], [142, 110, 192, 243, 459, 401]]
2023-07-11 12:30:32,961 [INFO] network.backbone [[-1, 1, 'ConvNormAct', [32, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 2]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 2]], [-1, 1, 'ConvNormAct', [64, 1, 1]], [-2, 1, 'ConvNormAct', [64, 1, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [[-1, -3, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [128, 1, 1]], [-3, 1, 'ConvNormAct', [128, 1, 1]], [-1, 1, 'ConvNormAct', [128, 3, 2]], [[-1, -3], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [128, 1, 1]], [-2, 1, 'ConvNormAct', [128, 1, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [[-1, -3, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [512, 1, 1]], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-3, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [256, 3, 2]], [[-1, -3], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-2, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [[-1, -3, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [1024, 1, 1]], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [512, 1, 1]], [-3, 1, 'ConvNormAct', [512, 1, 1]], [-1, 1, 'ConvNormAct', [512, 3, 2]], [[-1, -3], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-2, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [[-1, -3, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [1024, 1, 1]]]
2023-07-11 12:30:32,961 [INFO] network.head [[-1, 1, 'SPPCSPC', [512]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'Upsample', ['None', 2, 'nearest']], [37, 1, 'ConvNormAct', [256, 1, 1]], [[-1, -2], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-2, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [[-1, -2, -3, -4, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [128, 1, 1]], [-1, 1, 'Upsample', ['None', 2, 'nearest']], [24, 1, 'ConvNormAct', [128, 1, 1]], [[-1, -2], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [128, 1, 1]], [-2, 1, 'ConvNormAct', [128, 1, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [-1, 1, 'ConvNormAct', [64, 3, 1]], [[-1, -2, -3, -4, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [128, 1, 1]], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [128, 1, 1]], [-3, 1, 'ConvNormAct', [128, 1, 1]], [-1, 1, 'ConvNormAct', [128, 3, 2]], [[-1, -3, 63], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-2, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [-1, 1, 'ConvNormAct', [128, 3, 1]], [[-1, -2, -3, -4, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'MP', []], [-1, 1, 'ConvNormAct', [256, 1, 1]], [-3, 1, 'ConvNormAct', [256, 1, 1]], [-1, 1, 'ConvNormAct', [256, 3, 2]], [[-1, -3, 51], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [512, 1, 1]], [-2, 1, 'ConvNormAct', [512, 1, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [-1, 1, 'ConvNormAct', [256, 3, 1]], [[-1, -2, -3, -4, -5, -6], 1, 'Concat', [1]], [-1, 1, 'ConvNormAct', [512, 1, 1]], [75, 1, 'RepConv', [256, 3, 1]], [88, 1, 'RepConv', [512, 3, 1]], [101, 1, 'RepConv', [1024, 3, 1]], [[102, 103, 104], 1, 'YOLOv7Head', ['nc', 'anchors', 'stride']]]
2023-07-11 12:30:32,961 [INFO] config configs/yolov7/yolov7.yaml
2023-07-11 12:30:32,961 [INFO] rank 0
2023-07-11 12:30:32,961 [INFO] rank_size 1
2023-07-11 12:30:32,961 [INFO] total_batch_size 16
2023-07-11 12:30:32,961 [INFO] callback []
2023-07-11 12:30:32,961 [INFO]
2023-07-11 12:30:32,963 [INFO] Please check the above information for the configurations
2023-07-11 12:30:33,910 [WARNING] Parse Model, args: nearest, keep str type
2023-07-11 12:30:34,007 [WARNING] Parse Model, args: nearest, keep str type
2023-07-11 12:30:34,410 [INFO] number of network params, total: 37.246339M, trainable: 37.196556M
2023-07-11 12:30:34,422 [INFO] Turn on recompute, and the results of the first 5 layers will be recomputed.
2023-07-11 12:30:54,044 [WARNING] Parse Model, args: nearest, keep str type
2023-07-11 12:30:54,141 [WARNING] Parse Model, args: nearest, keep str type
2023-07-11 12:30:54,554 [INFO] number of network params, total: 37.246339M, trainable: 37.196556M
2023-07-11 12:30:54,566 [INFO] Turn on recompute, and the results of the first 5 layers will be recomputed.
TotalTime = 12.5261, [16] [parse]: 0.0276879 [symbol_resolve]: 0.0445111, [1] [Cycle 1]: 0.0444429, [1] [resolve]: 0.0444209 [combine_like_graphs]: 1.16001e-06 [meta_unpack_prepare]: 0.000176782 [abstract_specialize]: 0.450212 [auto_monad]: 0.00803334 [inline]: 5.7661e-05 [pipeline_split]: 3.225e-05 [optimize]: 0.267299, [22] [py_interpret_to_execute]: 0.000856569 [simplify_data_structures]: 0.00175699 [opt_a]: 0.225315, [2] [Cycle 1]: 0.189621, [26] [expand_dump_flag]: 2.129e-05 [switch_simplify]: 0.00170734 [a_1]: 0.0976232 [recompute_prepare]: 0.000858418 [updatestate_depend_eliminate]: 0.00873714 [updatestate_assign_eliminate]: 0.00803381 [updatestate_loads_eliminate]: 0.000671716 [parameter_eliminate]: 4.80999e-06 [a_2]: 0.00837349 [accelerated_algorithm]: 0.000756657 [pynative_shard]: 2.82e-06 [auto_parallel]: 6.34999e-06 [parallel]: 2.007e-05 [allreduce_fusion]: 0.000251762 [virtual_dataset]: 0.000502855 [get_grad_eliminate_]: 0.000457934 [virtual_output]: 0.000458515 [meta_fg_expand]: 0.00116805 [after_resolve]: 0.00144858 [a_after_grad]: 0.000627087 [renormalize]: 0.044053 [real_op_eliminate]: 0.000615936 [auto_monad_grad]: 6.74999e-06 [auto_monad_eliminator]: 0.00226781 [cse]: 0.00587297 [a_3]: 0.0046758 [Cycle 2]: 0.0336986, [26] [expand_dump_flag]: 2.83e-06 [switch_simplify]: 0.000454435 [a_1]: 0.00910322 [recompute_prepare]: 0.000364614 [updatestate_depend_eliminate]: 0.000308563 [updatestate_assign_eliminate]: 0.000389534 [updatestate_loads_eliminate]: 0.000425564 [parameter_eliminate]: 3.24e-06 [a_2]: 0.00805789 [accelerated_algorithm]: 0.000751357 [pynative_shard]: 2.22001e-06 [auto_parallel]: 4.88e-06 [parallel]: 3.75e-06 [allreduce_fusion]: 0.000214552 [virtual_dataset]: 0.000485115 [get_grad_eliminate_]: 0.000441414 [virtual_output]: 0.000442174 [meta_fg_expand]: 0.000825678 [after_resolve]: 0.00129374 [a_after_grad]: 0.000579015 [renormalize]: 2.00002e-07 [real_op_eliminate]: 0.000440745 [auto_monad_grad]: 3.28e-06 [auto_monad_eliminator]: 0.0019583 [cse]: 0.00241402 [a_3]: 0.00446869 [item_dict_eliminate_after_opt_a]: 0.00123497, [1] [Cycle 1]: 0.00122313, [2] [mutable_eliminate]: 0.000456055 [item_dict_eliminate]: 0.000746987 [clean_after_opta]: 0.000492505 [opt_b]: 0.0161969, [1] [Cycle 1]: 0.0161843, [7] [b_1]: 0.0124879 [b_2]: 0.000552775 [updatestate_depend_eliminate]: 0.000305804 [updatestate_assign_eliminate]: 0.000387884 [updatestate_loads_eliminate]: 0.000426934 [renormalize]: 6.50005e-07 [cse]: 0.0019323 [cconv]: 0.000306543 [opt_after_cconv]: 0.00515957, [1] [Cycle 1]: 0.00514809, [6] [c_1]: 0.00202726 [updatestate_depend_eliminate]: 0.000309404 [updatestate_assign_eliminate]: 0.000387333 [updatestate_loads_eliminate]: 0.000425044 [cse]: 0.00193491 [renormalize]: 6.90008e-07 [remove_dup_value]: 0.000110661 [tuple_transform]: 0.00356582, [1] [Cycle 1]: 0.00355534, [2] [d_1]: 0.00353368 [renormalize]: 5.50004e-07 [add_cache_embedding]: 0.00431373 [add_recomputation]: 0.00476212 [cse_after_recomputation]: 0.00210252, [1] [Cycle 1]: 0.00208754, [1] [cse]: 0.00202823 [environ_conv]: 0.000867508 [label_micro_interleaved_index]: 3.60001e-06 [slice_recompute_activation]: 3.10101e-06 [micro_interleaved_order_control]: 2.51e-06 [reorder_send_recv_between_fp_bp]: 2.26e-06 [comm_op_add_attrs]: 2.41e-05 [add_comm_op_reuse_tag]: 1.91999e-06 [overlap_opt_shard_in_pipeline]: 1.62999e-06 [handle_group_info]: 1.49e-06 [auto_monad_reorder]: 0.00286117 [eliminate_forward_cnode]: 7.30011e-07 [eliminate_special_op_node]: 0.0021148 [validate]: 0.00246263 [distribtued_split]: 1.77001e-06 [task_emit]: 11.7201 [execute]: 1.074e-05 Sums parse : 0.027688s : 0.22% symbol_resolve.resolve : 0.044421s : 0.35% combine_like_graphs : 0.000001s : 0.00% meta_unpack_prepare : 0.000177s : 0.00% abstract_specialize : 0.450212s : 3.60% auto_monad : 0.008033s : 0.06% inline : 0.000058s : 0.00% pipeline_split : 0.000032s : 0.00% optimize.py_interpret_to_execute : 0.000857s : 0.01% optimize.simplify_data_structures : 0.001757s : 0.01% optimize.opt_a.expand_dump_flag : 0.000024s : 0.00% optimize.opt_a.switch_simplify : 0.002162s : 0.02% optimize.opt_a.a_1 : 0.106726s : 0.85% optimize.opt_a.recompute_prepare : 0.001223s : 0.01% optimize.opt_a.updatestate_depend_eliminate : 0.009046s : 0.07% optimize.opt_a.updatestate_assign_eliminate : 0.008423s : 0.07% optimize.opt_a.updatestate_loads_eliminate : 0.001097s : 0.01% optimize.opt_a.parameter_eliminate : 0.000008s : 0.00% optimize.opt_a.a_2 : 0.016431s : 0.13% optimize.opt_a.accelerated_algorithm : 0.001508s : 0.01% optimize.opt_a.pynative_shard : 0.000005s : 0.00% optimize.opt_a.auto_parallel : 0.000011s : 0.00% optimize.opt_a.parallel : 0.000024s : 0.00% optimize.opt_a.allreduce_fusion : 0.000466s : 0.00% optimize.opt_a.virtual_dataset : 0.000988s : 0.01% optimize.opt_a.get_grad_eliminate_ : 0.000899s : 0.01% optimize.opt_a.virtual_output : 0.000901s : 0.01% optimize.opt_a.meta_fg_expand : 0.001994s : 0.02% optimize.opt_a.after_resolve : 0.002742s : 0.02% optimize.opt_a.a_after_grad : 0.001206s : 0.01% optimize.opt_a.renormalize : 0.044053s : 0.35% optimize.opt_a.real_op_eliminate : 0.001057s : 0.01% optimize.opt_a.auto_monad_grad : 0.000010s : 0.00% optimize.opt_a.auto_monad_eliminator : 0.004226s : 0.03% optimize.opt_a.cse : 0.008287s : 0.07% optimize.opt_a.a_3 : 0.009144s : 0.07% optimize.item_dict_eliminate_after_opt_a.mutable_eliminate : 0.000456s : 0.00% optimize.item_dict_eliminate_after_opt_a.item_dict_eliminate : 0.000747s : 0.01% optimize.clean_after_opta : 0.000493s : 0.00% optimize.opt_b.b_1 : 0.012488s : 0.10% optimize.opt_b.b_2 : 0.000553s : 0.00% optimize.opt_b.updatestate_depend_eliminate : 0.000306s : 0.00% optimize.opt_b.updatestate_assign_eliminate : 0.000388s : 0.00% optimize.opt_b.updatestate_loads_eliminate : 0.000427s : 0.00% optimize.opt_b.renormalize : 0.000001s : 0.00% optimize.opt_b.cse : 0.001932s : 0.02% optimize.cconv : 0.000307s : 0.00% optimize.opt_after_cconv.c_1 : 0.002027s : 0.02% optimize.opt_after_cconv.updatestate_depend_eliminate : 0.000309s : 0.00% optimize.opt_after_cconv.updatestate_assign_eliminate : 0.000387s : 0.00% optimize.opt_after_cconv.updatestate_loads_eliminate : 0.000425s : 0.00% optimize.opt_after_cconv.cse : 0.001935s : 0.02% optimize.opt_after_cconv.renormalize : 0.000001s : 0.00% optimize.remove_dup_value : 0.000111s : 0.00% optimize.tuple_transform.d_1 : 0.003534s : 0.03% optimize.tuple_transform.renormalize : 0.000001s : 0.00% optimize.add_cache_embedding : 0.004314s : 0.03% optimize.add_recomputation : 0.004762s : 0.04% optimize.cse_after_recomputation.cse : 0.002028s : 0.02% optimize.environ_conv : 0.000868s : 0.01% optimize.label_micro_interleaved_index : 0.000004s : 0.00% optimize.slice_recompute_activation : 0.000003s : 0.00% optimize.micro_interleaved_order_control : 0.000003s : 0.00% optimize.reorder_send_recv_between_fp_bp : 0.000002s : 0.00% optimize.comm_op_add_attrs : 0.000024s : 0.00% optimize.add_comm_op_reuse_tag : 0.000002s : 0.00% optimize.overlap_opt_shard_in_pipeline : 0.000002s : 0.00% optimize.handle_group_info : 0.000001s : 0.00% auto_monad_reorder : 0.002861s : 0.02% eliminate_forward_cnode : 0.000001s : 0.00% eliminate_special_op_node : 0.002115s : 0.02% validate : 0.002463s : 0.02% distribtued_split : 0.000002s : 0.00% task_emit : 11.720087s : 93.59% execute : 0.000011s : 0.00%
Time group info: ------[substitution.] 0.069899 13241 0.05% : 0.000035s : 2: substitution.depend_value_elim 61.05% : 0.042670s : 10: substitution.getattr_resolve 0.89% : 0.000624s : 1751: substitution.graph_param_transform 26.80% : 0.018731s : 955: substitution.inline 0.11% : 0.000079s : 320: substitution.less_batch_normalization 0.05% : 0.000036s : 9: substitution.meta_unpack_prepare 0.73% : 0.000507s : 1906: substitution.replace_old_param 2.34% : 0.001638s : 952: substitution.tuple_list_get_item_eliminator 2.96% : 0.002069s : 3508: substitution.updatestate_pure_node_eliminater 5.02% : 0.003510s : 3828: substitution.updatestate_useless_node_eliminater ------[renormalize.] 0.043854 2 50.05% : 0.021948s : 1: renormalize.infer 49.95% : 0.021905s : 1: renormalize.specialize ------[replace.] 0.020341 1916 5.97% : 0.001215s : 9: replace.getattr_resolve 61.80% : 0.012570s : 955: replace.inline 32.23% : 0.006556s : 952: replace.tuple_list_get_item_eliminator ------[match.] 0.063034 1916 67.69% : 0.042665s : 9: match.getattr_resolve 29.72% : 0.018731s : 955: match.inline 2.60% : 0.001638s : 952: match.tuple_list_get_item_eliminator ------[func_graph_cloner_run.] 0.037680 1004 34.60% : 0.013037s : 47: func_graph_cloner_run.FuncGraphClonerGraph 22.20% : 0.008364s : 862: func_graph_cloner_run.FuncGraphClonerNode 43.20% : 0.016279s : 95: func_graph_cloner_run.FuncGraphSpecializer ------[meta_graph.] 0.000000 0 ------[manager.] 0.000000 0 ------[pynative] 0.000000 0 ------[others.] 0.210734 104 12.09% : 0.025470s : 50: opt.transform.opt_a 5.90% : 0.012443s : 23: opt.transform.opt_b 21.07% : 0.044391s : 2: opt.transform.opt_resolve 0.57% : 0.001198s : 2: opt.transforms.item_dict_eliminate_after_opt_a 0.08% : 0.000160s : 1: opt.transforms.meta_unpack_prepare 56.65% : 0.119370s : 20: opt.transforms.opt_a 0.96% : 0.002024s : 1: opt.transforms.opt_after_cconv 0.26% : 0.000550s : 1: opt.transforms.opt_b 1.68% : 0.003531s : 1: opt.transforms.opt_trans_graph 0.76% : 0.001597s : 3: opt.transforms.special_op_eliminate
2023-07-11 12:31:07,575 [INFO] ema_weight not exist, default pretrain weight is currently used. 2023-07-11 12:31:07,722 [INFO] Dataset cache file hash/version check fail. 2023-07-11 12:31:07,722 [INFO] Datset caching now... Scanning '/home/ma-user/work/night_car/car_train.cache' images and labels... 4726 found, 0 missing, 179 empty, 0 corrupted: 100%|█| 4726/4726 [00:03< 2023-07-11 12:31:11,640 [INFO] New cache created: /home/ma-user/work/night_car/car_train.cache.npy 2023-07-11 12:31:11,647 [INFO] Dataset caching success. 2023-07-11 12:31:11,725 [INFO] Dataloader num parallel workers: [4] 2023-07-11 12:31:14,025 [INFO] Registry(name=callback, total=4) 2023-07-11 12:31:14,025 [INFO] (0): YoloxSwitchTrain in mindyolo/utils/callback.py 2023-07-11 12:31:14,025 [INFO] (1): EvalWhileTrain in mindyolo/utils/callback.py 2023-07-11 12:31:14,025 [INFO] (2): SummaryCallback in mindyolo/utils/callback.py 2023-07-11 12:31:14,025 [INFO] (3): ProfilerCallback in mindyolo/utils/callback.py 2023-07-11 12:31:14,025 [INFO] 2023-07-11 12:31:14,427 [INFO] got 1 active callback as follows: 2023-07-11 12:31:14,428 [INFO] SummaryCallback() 2023-07-11 12:31:14,428 [WARNING] The first epoch will be compiled for the graph, which may take a long time; You can come back later :). [ERROR] ANALYZER(28402,ffffbed26a70,python3):2023-07-11-12:58:09.445.089 [mindspore/ccsrc/pipeline/jit/static_analysis/async_eval_result.cc:66] HandleException] Exception happened, check the information as below.
The function call stack (See file '/home/ma-user/work/mindyolo/rank_0/om/analyze_fail.dat' for more details. Get instructions about analyze_fail.dat
at https://www.mindspore.cn/search?inputValue=analyze_fail.dat):
0 In file /home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py:72
return train_step_func(*args)
^
1 In file /home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py:57
if optimizer_update:
2 In file /home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py:52
(loss, loss_items), grads = grad_fn(x, label)
^
3 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/base.py:574
return grad_(fn, weights)(*args)
^
4 In file /home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py:45
loss, loss_items = loss_fn(pred, label, x)
^
5 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:81
for pp in p:
6 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:69
bs, as_, gjs, gis, targets, anchors, tmasks = self.build_targets(p, targets, imgs) # bs: (nl, bs*5*na*gt_max)
^
7 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:150
for i in range(self.nl):
8 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:126
indices, anch, tmasks = self.find_3_positive(p, targets)
^
9 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:296
for i in range(self.nl):
10 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:298
gain[2:6] = get_tensor(shape, targets.dtype)[[3, 2, 3, 2]] # xyxy gain # [W, H, W, H]
^
11 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py:918
if check_result:
12 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py:921
if step == 1 and not const_utils.is_ascend():
13 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py:931
if F.is_sequence_value_unknown(data_shape):
14 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:298
gain[2:6] = get_tensor(shape, targets.dtype)[[3, 2, 3, 2]] # xyxy gain # [W, H, W, H]
^
15 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py:934
indices = const_utils.slice2indices(input_slice, data_shape)
^
Traceback (most recent call last):
File "train.py", line 309, in
- Ascend Error Message:
E89999: Inner Error, Please contact support engineer! E89999 op[Range], compile info not contain [_pattern][FUNC:AutoTilingHandlerParser][FILE:auto_tiling.cc][LINE:67] TraceBack (most recent call last): Failed to parse compile json[{"_sgt_cube_vector_core_type":"AiCore","device_id":"0"}] for op[Range, Range].[FUNC:TurnToOpParaCalculateV4][FILE:op_tiling.cc][LINE:442]
(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
- The Traceback of Net Construct Code:
The function call stack (See file '/home/ma-user/work/mindyolo/rank_0/om/analyze_fail.dat' for more details. Get instructions about analyze_fail.dat
at https://www.mindspore.cn/search?inputValue=analyze_fail.dat):
0 In file /home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py:72
return train_step_func(*args)
^
1 In file /home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py:57
if optimizer_update:
2 In file /home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py:52
(loss, loss_items), grads = grad_fn(x, label)
^
3 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/base.py:574
return grad_(fn, weights)(*args)
^
4 In file /home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py:45
loss, loss_items = loss_fn(pred, label, x)
^
5 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:81
for pp in p:
6 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:69
bs, as_, gjs, gis, targets, anchors, tmasks = self.build_targets(p, targets, imgs) # bs: (nl, bs*5*na*gt_max)
^
7 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:150
for i in range(self.nl):
8 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:126
indices, anch, tmasks = self.find_3_positive(p, targets)
^
9 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:296
for i in range(self.nl):
10 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:298
gain[2:6] = get_tensor(shape, targets.dtype)[[3, 2, 3, 2]] # xyxy gain # [W, H, W, H]
^
11 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py:918
if check_result:
12 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py:921
if step == 1 and not const_utils.is_ascend():
13 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py:931
if F.is_sequence_value_unknown(data_shape):
14 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py:298
gain[2:6] = get_tensor(shape, targets.dtype)[[3, 2, 3, 2]] # xyxy gain # [W, H, W, H]
^
15 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/multitype_ops/_compile_utils.py:934
indices = const_utils.slice2indices(input_slice, data_shape)
^
- C++ Call Stack: (For framework developers)
mindspore/ccsrc/plugin/device/ascend/kernel/tbe/dynamic_tbe_kernel_mod.cc:126 Resize
| / | | | | | /\ | |
| \ / | ___ | | ___ | | / \ _ __ | | ___
| |/| | / _ \ / _ | / _ \ | | / /\ \ | | | __ | / |
| | | | | () | | (| | | / | | / ____ \ | | | | _
|| || _/ _ _| _| || // _\ || _| |___/
Using user ma-user
EulerOS 2.0 (SP8), CANN-6.0.1
Tips:
- Navigate to the target conda environment. For details, see /home/ma-user/README.
- Copy (Ctrl+C) and paste (Ctrl+V) on the jupyter terminal.
- Store your data in /home/ma-user/work, to which a persistent volume is mounted.
This seems to be related to the Mindshare version. You can try using the master branch code on Mindshare 2.0 and the r0.1 branch on Mindshare 1.8.1.
目前的这个错误是出现在modelarts上的mindspore2.0镜像(支持人员提供),modelarts上的mindspore-1.8.1镜像在r0.1也有不同的错误
尝试mindspore-1.8.1 有一下错误: ERROR] ANALYZER(77504,ffffa12a0a40,python3):2023-07-11-18:21:39.720.409 [mindspore/ccsrc/pipeline/jit/static_analysis/async_eval_result.cc:66] HandleException] Exception happened, check the information as below.
The function call stack (See file '/home/ma-user/work/mindyolo/rank_0/om/analyze_fail.dat' for more details. Get instructions about analyze_fail.dat
at https://www.mindspore.cn/search?inputValue=analyze_fail.dat):
0 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py(81)
for pp in p:
1 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py(86)
for i in range(self.nl): # layer index
^
2 In file /home/ma-user/work/mindyolo/mindyolo/models/losses/yolov7_loss.py(123)
return _loss * bs, ops.stop_gradient(ops.stack((_loss, lbox, lobj, lcls)))
^
3 In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/function/array_func.py(1198)
return _stack(input_x)
^
Traceback (most recent call last):
File "train.py", line 290, in
看这个报错像是在图编译阶段出现的算子类型不匹配问题,有比较高的概率是跟cann包和mindspore版本相关;
可以尝试运行以下命令查看mindspore版本并验证是否正常安装
pip show mindspore
cat /path_to/mindspore/.commit_id
python
>>> import mindspore as ms
>>> ms.run_check()
Name: mindspore-ascend Version: 1.8.1 Summary: MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios. Home-page: https://www.mindspore.cn Author: The MindSpore Authors Author-email: [email protected] License: Apache 2.0 Location: /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages Requires: asttokens, astunparse, numpy, packaging, pillow, protobuf, psutil, scipy Required-by: mindx-elastic
MindSpore version: 1.8.1 The result of multiplication calculation is correct, MindSpore has been installed successfully!
========
老师,我是先使用官方提供的镜像:
然后进入镜像安装的mindspore-ascend-1.8.1,然后在modelarts上运行的
这种方式有可能导致mindspore与cann版本不匹配而引发一些未知的错误,你可以尝试找官方支持人员提供标准的1.8.1/1.9的配套镜像;安装版本可以参考MindSpore官网
TypeError: For 'Stack', the 'x_type[3]' should be = base: Tensor[Float32], but got Float32. 老师,你知道这几种类型是什么吗,目前我只用lbox做loss_item 对结果有影响吗,虽然跑起来速度有点慢。
TypeError: For 'Stack', the 'x_type[3]' should be = base: Tensor[Float32], but got Float32. 老师,你知道这几种类型是什么吗,目前我只用lbox做loss_item 对结果有影响吗,虽然跑起来速度有点慢。
类型信息可以在这个地方增加打印进行查看;如果只修改用于打印的loss,对结果不会有影响;
建议使用指定的mindspore版本,其他版本可能会存在版本适配问题,mindspore安装可以参考 mindyolo-r0.1分支对应mindspore 1.8.1(以及对应的cann版本) mindyolo-master分支对应mindspore 2.0(以及对应的cann版本)
TypeError: For 'Stack', the 'x_type[3]' should be = base: Tensor[Float32], but got Float32. 老师,你知道这几种类型是什么吗,目前我只用lbox做loss_item 对结果有影响吗,虽然跑起来速度有点慢。
类型信息可以在这个地方增加打印进行查看;如果只修改用于打印的loss,对结果不会有影响;
老师,这个类型完全print不出来,静态模式无法切换
可以尝试设置这两个参数以使用动态图方式运行代码
--ms_mode 1
--ms_jit False
t int64, reduce precision from int64 to int32.
Traceback (most recent call last):
File "train.py", line 291, in
- C++ Call Stack: (For framework developers)
mindspore/ccsrc/backend/common/session/kernel_build_client.h:110 Response
/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 199 leaked semaphores to clean up at shutdown len(cache))
看起来是求梯度的过程出现了问题
看起来是求梯度的过程出现了问题
大概跑了7个epoch后出现的问题,大概率跟数据没有多大的关系;不过训练过程中,出现了很多WARNING:"don't support int64, reduce precision from int64 to int32"
t int64, reduce precision from int64 to int32. Traceback (most recent call last): File "train.py", line 291, in train(args) File "train.py", line 283, in train ms_jit=args.ms_jit File "/home/ma-user/work/mindyolo/mindyolo/utils/trainer_factory.py", line 170, in train self.train_step(imgs, labels, cur_step=cur_step, cur_epoch=cur_epoch) File "/home/ma-user/work/mindyolo/mindyolo/utils/trainer_factory.py", line 218, in train_step loss, loss_item, _, grads_finite = self.train_step_fn(imgs, labels, True) File "/home/ma-user/work/mindyolo/mindyolo/utils/train_step_factory.py", line 51, in train_step_func (loss, loss_items), grads = grad_fn(x, label) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/functional.py", line 455, in inner_aux_grad_fn return res, grad_weight(aux_fn, weights)(*args) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/base.py", line 530, in after_grad return grad(fn, weights)(*args, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 98, in wrapper results = fn(*arg, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/ops/composite/base.py", line 518, in after_grad out = pynative_executor(fn, grad.sens_param, *args, **kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 1001, in call return self._executor(sens_param, obj, args) RuntimeError: Response is empty
- C++ Call Stack: (For framework developers)
mindspore/ccsrc/backend/common/session/kernel_build_client.h:110 Response
/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 199 leaked semaphores to clean up at shutdown len(cache))
这个应该是内存泄露,每次跑一个epoch内存占用率会上升
看起来是求梯度的过程出现了问题
大概跑了7个epoch后出现的问题,大概率跟数据没有多大的关系;不过训练过程中,出现了很多WARNING:"don't support int64, reduce precision from int64 to int32"
这个warning一般不影响正常训练
看起来是求梯度的过程出现了问题
大概跑了7个epoch后出现的问题,大概率跟数据没有多大的关系;不过训练过程中,出现了很多WARNING:"don't support int64, reduce precision from int64 to int32"
这个warning一般不影响正常训练
但是内存持续上涨,跑几个epoch自己就挂了
看起来是求梯度的过程出现了问题
大概跑了7个epoch后出现的问题,大概率跟数据没有多大的关系;不过训练过程中,出现了很多WARNING:"don't support int64, reduce precision from int64 to int32"
这个warning一般不影响正常训练
但是内存持续上涨,跑几个epoch自己就挂了
显存泄漏应该会报out of memory,这个看起来是pynative下执行或编译过程出了问题,可以尝试设置graph进行完整训练
--ms_mode 0 --ms_jit True
看起来是求梯度的过程出现了问题
大概跑了7个epoch后出现的问题,大概率跟数据没有多大的关系;不过训练过程中,出现了很多WARNING:"don't support int64, reduce precision from int64 to int32"
这个warning一般不影响正常训练
但是内存持续上涨,跑几个epoch自己就挂了
多问一句,这个是在coco数据集用的默认配置训练吗 还有环境的 代码、mindspore和cann包 的版本是否是匹配的
1.事实上,在modelarts上,使用官方的1.8.0的镜像安装了1.8.1,看文档应该是匹配的;
- 在训练上,本来也是使用默认配置进行训练,但运行中应该是类型不一致:TypeError: For 'Stack', the 'x_type[3]' should be = base: Tensor[Float32], but got Float32.
- 技术支持人员在2.0上跑说是没有问题。
1.事实上,在modelarts上,使用官方的1.8.0的镜像安装了1.8.1,看文档应该是匹配的;
- 在训练上,本来也是使用默认配置进行训练,但运行中应该是类型不一致:TypeError: For 'Stack', the 'x_type[3]' should be = base: Tensor[Float32], but got Float32.
- 技术支持人员在2.0上跑说是没有问题。
mindspore和cann版本不匹配有可能会出现一些奇怪的问题,当前如果有2.0的标准环境可以直接在2.0上跑 对应的mindyolo代码可以用master分支的