nni Errors when trying to Tuning a YOLOv5 model

Errors when trying to Tuning a YOLOv5 model

Open xiaoerqi opened this issue 2 years ago • 0 comments

Describe the issue: When I tuning a YOLOv5 model hyperparameter, No more searching after the 33rd time. We do not know why.

Configuration: search_space = { 'lr0': {'_type': 'loguniform', '_value': [0.0001, 0.1]}, 'lrf': {'_type': 'loguniform', '_value': [0.0001, 0.1]}, 'momentum': {'_type': 'uniform', '_value': [0, 1]}, 'weight_decay': {'_type': 'loguniform', '_value': [0.00001, 0.1]}, 'warmup_epochs': {'_type': 'choice', '_value': [1.0, 2.0, 3.0, 4.0, 5.0]}, 'warmup_momentum': {'_type': 'uniform', '_value': [0, 1]}, 'warmup_bias_lr': {'_type': 'loguniform', '_value': [0.01, 0.5]}, 'box': {'_type': 'loguniform', '_value': [0.02, 0.2]}, 'cls': {'_type': 'uniform', '_value': [0.2, 4.0]}, 'cls_pw': {'_type': 'uniform', '_value': [0.2, 4.0]}, 'obj': {'_type': 'uniform', '_value': [ 0.2, 4.0]}, 'obj_pw': {'_type': 'uniform', '_value': [ 0.5, 2.0]}, 'iou_t': {'_type': 'uniform', '_value': [0.1, 0.7]}, 'anchor_t': {'_type': 'uniform', '_value': [2.0, 8.0]}, 'fl_gamma': {'_type': 'uniform', '_value': [0.0, 2.0]}, 'hsv_h': {'_type': 'uniform', '_value': [0.0, 0.1]}, 'hsv_s': {'_type': 'uniform', '_value': [0.0, 0.9]}, 'hsv_v': {'_type': 'uniform', '_value': [0.0, 0.9]}, 'degrees': {'_type': 'uniform', '_value': [0.0, 45.0]}, 'translate': {'_type': 'uniform', '_value': [0.0, 0.9]}, 'scale': {'_type': 'uniform', '_value': [0.0, 0.9]}, 'shear': {'_type': 'uniform', '_value': [0.0, 10.0]}, 'perspective': {'_type': 'uniform', '_value': [0.0, 0.001]}, 'flipud': {'_type': 'uniform', '_value': [0.0, 1.0]}, 'fliplr': {'_type': 'uniform', '_value': [0.0, 1.0]}, 'mosaic': {'_type': 'uniform', '_value': [0.0, 1.0]}, 'mixup': {'_type': 'uniform', '_value': [0.0, 1.0]}, 'copy_paste': {'_type': 'uniform', '_value': [0.0, 1.0]}, }

from nni.experiment import Experiment experiment = Experiment('local')

experiment.config.trial_command = 'python train.py' experiment.config.trial_code_directory = '.'

experiment.config.search_space = search_space

experiment.config.tuner.name = 'PBTTuner' experiment.config.tuner.class_args['optimize_mode'] = 'maximize'

experiment.config.max_trial_number = 200 experiment.config.trial_concurrency = 2

Log message:

nnimanager.log: [2022-06-28 21:55:17] INFO (NNIManager) Trial job d1iGX status changed from WAITING to RUNNING [2022-06-28 21:55:17] INFO (NNIManager) Trial job DhFgI status changed from WAITING to RUNNING [2022-06-28 22:15:30] INFO (NNIManager) Trial job DhFgI status changed from RUNNING to FAILED [2022-06-28 22:15:30] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 33, "parameter_source": "algorithm", "parameters": {"lr0": 0.007847636081514123, "lrf": 0.00010000000000000009, "momentum": 0.9949611961553663, "weight_decay": 0.0002105884612143452, "warmup_epochs": 3, "warmup_momentum": 0.8239692329627215, "warmup_bias_lr": 0.010000000000000004, "box": 0.09659218313013274, "cls": 4, "cls_pw": 0.8660525584460326, "obj": 0.5513053722657748, "obj_pw": 1.7026707384392579, "iou_t": 0.3110236788877514, "anchor_t": 4.297665940821455, "fl_gamma": 1.3170669026460455, "hsv_h": 0.04555749854374458, "hsv_s": 0.9, "hsv_v": 0.1603120726670286, "degrees": 29.65120510167928, "translate": 0.40978721707386856, "scale": 0.4012725212596169, "shear": 0.0381511756762154, "perspective": 0.0006953241095787209, "flipud": 0.6489223391335592, "fliplr": 0.8843805999769242, "mosaic": 0.35338938654888724, "mixup": 0.33934572807513336, "copy_paste": 0.2963437529070458, "load_checkpoint_dir": "/meda_data/meda_home/ai0400/nni-experiments/12na0qtr/checkpoint/0/2", "save_checkpoint_dir": "/meda_data/meda_home/ai0400/nni-experiments/12na0qtr/checkpoint/0/3"}, "parameter_index": 0} [2022-06-28 22:15:35] INFO (NNIManager) submitTrialJob: form: { sequenceId: 33, hyperParameters: { value: '{"parameter_id": 33, "parameter_source": "algorithm", "parameters": {"lr0": 0.007847636081514123, "lrf": 0.00010000000000000009, "momentum": 0.9949611961553663, "weight_decay": 0.0002105884612143452, "warmup_epochs": 3, "warmup_momentum": 0.8239692329627215, "warmup_bias_lr": 0.010000000000000004, "box": 0.09659218313013274, "cls": 4, "cls_pw": 0.8660525584460326, "obj": 0.5513053722657748, "obj_pw": 1.7026707384392579, "iou_t": 0.3110236788877514, "anchor_t": 4.297665940821455, "fl_gamma": 1.3170669026460455, "hsv_h": 0.04555749854374458, "hsv_s": 0.9, "hsv_v": 0.1603120726670286, "degrees": 29.65120510167928, "translate": 0.40978721707386856, "scale": 0.4012725212596169, "shear": 0.0381511756762154, "perspective": 0.0006953241095787209, "flipud": 0.6489223391335592, "fliplr": 0.8843805999769242, "mosaic": 0.35338938654888724, "mixup": 0.33934572807513336, "copy_paste": 0.2963437529070458, "load_checkpoint_dir": "/meda_data/meda_home/ai0400/nni-experiments/12na0qtr/checkpoint/0/2", "save_checkpoint_dir": "/meda_data/meda_home/ai0400/nni-experiments/12na0qtr/checkpoint/0/3"}, "parameter_index": 0}', index: 0 }, placementConstraint: { type: 'None', gpus: [] } } [2022-06-28 22:15:37] WARNING (IpcInterface) Commands jammed in buffer! [2022-06-28 22:15:41] INFO (NNIManager) Trial job d1iGX status changed from RUNNING to SUCCEEDED [2022-06-28 22:15:41] WARNING (IpcInterface) Commands jammed in buffer! [2022-06-28 22:15:41] INFO (NNIManager) Trial job q8xey status changed from WAITING to RUNNING [2022-06-28 22:15:41] WARNING (IpcInterface) Commands jammed in buffer! [2022-06-28 22:15:42] WARNING (IpcInterface) Commands jammed in buffer! [2022-06-28 22:15:47] WARNING (IpcInterface) Commands jammed in buffer! [2022-06-28 22:15:52] WARNING (IpcInterface) Commands jammed in buffer! [2022-06-28 22:15:57] WARNING (IpcInterface) Commands jammed in buffer! [2022-06-28 22:15:58] WARNING (IpcInterface) Commands jammed in buffer!
dispatcher.log: [2022-06-28 21:55:12] INFO (pbt_tuner_AutoML/Thread-1) Generate parameter : {'lr0': 0.000100958520617837, 'lrf': 0.00012419880162974732, 'momentum': 0.4447399146365053, 'weight_decay': 0.008890009183227533, 'warmup_epochs': 2, 'warmup_momentum': 0.3324798996851569, 'warmup_bias_lr': 0.06482658512614052, 'box': 0.04015446182463376, 'cls': 3.430996203056549, 'cls_pw': 3.5531838970627616, 'obj': 3.2292296977345654, 'obj_pw': 1.1423220317753269, 'iou_t': 0.2381633966927072, 'anchor_t': 3.6624527922612113, 'fl_gamma': 1.2938095953206938, 'hsv_h': 0.08724946381814033, 'hsv_s': 0.48729518447206005, 'hsv_v': 0.6850549331511687, 'degrees': 8.35411877589698, 'translate': 0.2799735150805144, 'scale': 0.49437072935142645, 'shear': 1.0739438860392903, 'perspective': 0.00032456166047105254, 'flipud': 0.3377371881780348, 'fliplr': 0.3595283254066768, 'mosaic': 0.6995583860676831, 'mixup': 0.9461625221554246, 'copy_paste': 0.7553324420125425, 'load_checkpoint_dir': '/meda_data/meda_home/ai0400/nni-experiments/12na0qtr/checkpoint/5/2', 'save_checkpoint_dir': '/meda_data/meda_home/ai0400/nni-experiments/12na0qtr/checkpoint/5/3'} [2022-06-28 22:15:27] INFO (pbt_tuner_AutoML/Thread-1) Get one trial result, id = 31, value = 0.7208189506901974 [2022-06-28 22:15:28] INFO (pbt_tuner_AutoML/Thread-1) Get one trial result, id = 32, value = 0.24413300195381346 [2022-06-28 22:15:30] ERROR (nni.runtime.msg_dispatcher_base/Thread-2) 'NoneType' object has no attribute 'score' Traceback (most recent call last): File "/meda_home/ai0400/.local/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 88, in command_queue_worker self.process_command(command, data) File "/meda_home/ai0400/.local/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 147, in process_command command_handlerscommand File "/meda_home/ai0400/.local/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 170, in handle_trial_end self.tuner.trial_end(load(data['hyper_params'])['parameter_id'], data['event'] == 'SUCCEEDED') File "/meda_home/ai0400/.local/lib/python3.8/site-packages/nni/algorithms/hpo/pbt_tuner.py", line 454, in trial_end trial_info.score = value AttributeError: 'NoneType' object has no attribute 'score' [2022-06-28 22:15:30] INFO (pbt_tuner_AutoML/Thread-1) Generate parameter : {'lr0': 0.007847636081514123, 'lrf': 0.00010000000000000009, 'momentum': 0.9949611961553663, 'weight_decay': 0.0002105884612143452, 'warmup_epochs': 3, 'warmup_momentum': 0.8239692329627215, 'warmup_bias_lr': 0.010000000000000004, 'box': 0.09659218313013274, 'cls': 4, 'cls_pw': 0.8660525584460326, 'obj': 0.5513053722657748, 'obj_pw': 1.7026707384392579, 'iou_t': 0.3110236788877514, 'anchor_t': 4.297665940821455, 'fl_gamma': 1.3170669026460455, 'hsv_h': 0.04555749854374458, 'hsv_s': 0.9, 'hsv_v': 0.1603120726670286, 'degrees': 29.65120510167928, 'translate': 0.40978721707386856, 'scale': 0.4012725212596169, 'shear': 0.0381511756762154, 'perspective': 0.0006953241095787209, 'flipud': 0.6489223391335592, 'fliplr': 0.8843805999769242, 'mosaic': 0.35338938654888724, 'mixup': 0.33934572807513336, 'copy_paste': 0.2963437529070458, 'load_checkpoint_dir': '/meda_data/meda_home/ai0400/nni-experiments/12na0qtr/checkpoint/0/2', 'save_checkpoint_dir': '/meda_data/meda_home/ai0400/nni-experiments/12na0qtr/checkpoint/0/3'} [2022-06-28 22:15:32] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting... [2022-06-28 22:15:35] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated
nnictl stdout and stderr: 300 epochs completed in 0.333 hours. Traceback (most recent call last): File "train.py", line 743, in main(opt) File "train.py", line 638, in main train(opt.hyp, opt, device, callbacks) File "train.py", line 486, in train strip_optimizer(f) # strip optimizers File "/meda_data/meda_home/ai0400/Experiment/yolov5-master/utils/general.py", line 879, in strip_optimizer x = torch.load(f, map_location=torch.device('cpu')) File "/meda_home/ai0400/.local/lib/python3.8/site-packages/torch/serialization.py", line 608, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/meda_home/ai0400/.local/lib/python3.8/site-packages/torch/serialization.py", line 777, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) EOFError: Ran out of input

How to reproduce it?: I don't know whether the failure of NNI to continue is caused by the error in yolov5. What can I do to avoid this bug?

Jun 29 '22 02:06 xiaoerqi

nni nni copied to clipboard

Errors when trying to Tuning a YOLOv5 model

nni
nni copied to clipboard