error of KeyError:Parameter containing:
First of all, Thank you for sharing your source code. I have an error when I try to train with single gpu (i.e distributed enable = False in dmcp.yaml-mbv2) for Mobilenet v2. Here is my config modification (I used CIFAR10 and I modify loss class in source code) $ python main.py --mode train --data ./dataset/ --config config/mbv2/dmcp.yaml --flops 43 Error message
(base) root@452fa72bec2d:/workspace/hdd/06_model_compression/dmcp# python main.py --mode train --data ./dataset/ --config config/mbv2/dmcp.yaml --flops 43
/workspace/hdd/06_model_compression/dmcp/utils/tools.py:61: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
[2020-07-30 02:19:50,612][ main.py][line: 51][ INFO] {'training': {'epoch': 40, 'sandwich': {'sample_type': 'offset', 'max_width': 1.5, 'min_width': 0.1, 'width_offset': 0.1, 'num_sample': 4}, 'label_smooth': 0.1, 'distillation': {'enable': True, 'temperature': 1, 'loss_weight': 1, 'hard_label': False}}, 'arch': {'target_flops': '43', 'train_freq': 1, 'sample_type': ['max', 'min', 'scheduled_random', 'scheduled_random'], 'floss_type': 'log_l1', 'flop_loss_weight': 0.1, 'num_flops_stats_sample': 3000, 'num_model_sample': 5, 'start_train': 15640}, 'validation': {'width': [1.5], 'calibration': {'enable': True, 'num_batch': 5}}, 'evaluation': {'width': [1.5], 'calibration': {'enable': True, 'num_batch': 5}}, 'model': {'type': 'DMCPMobileNetV2', 'kwargs': {'num_classes': 10, 'input_size': 32, 'width': [0.1, 1.5, 0.1], 'prob_type': 'sigmoid'}, 'runner': {'type': 'DMCPRunner'}}, 'recover': {'enable': False, 'checkpoint': 'None'}, 'distributed': {'enable': False}, 'optimizer': {'momentum': 0.9, 'weight_decay': 4e-05, 'nesterov': True, 'no_wd': True}, 'lr_scheduler': {'base_lr': 0.2, 'warmup_lr': 0.5, 'warmup_steps': 1000, 'min_lr': 0.08, 'max_iter': 31280}, 'arch_lr_scheduler': {'base_lr': 0.5, 'warmup_lr': 0.5, 'min_lr': 0.1, 'max_iter': 31280, 'warmup_steps': 15640}, 'dataset': {'type': 'CIFAR10', 'augmentation': {'test_resize': 32, 'color_jitter': [0.2, 0.2, 0.2, 0.1]}, 'workers': 4, 'batch_size': 64, 'num_classes': 10, 'input_size': 32, 'path': './dataset/'}, 'logging': {'print_freq': 50}, 'random_seed': 0, 'save_path': './results/DMCPMobileNetV2_43_073002'}
[2020-07-30 02:19:50,613][normal_runner.py][line: 159][ INFO] using label_smooth: 0.1
Traceback (most recent call last):
File "main.py", line 75, in
main()
File "main.py", line 54, in main
train(config, runner, loaders, checkpoint, tb_logger)
File "main.py", line 30, in train
runner.train(train_loader, val_loader, optimizer, lr_scheduler, tb_logger)
File "/workspace/hdd/06_model_compression/dmcp/runner/dmcp_runner.py", line 46, in train
self._train_one_batch(x, y, optimizer, lr_scheduler, meters, criterions, end)
File "/workspace/hdd/06_model_compression/dmcp/runner/dmcp_runner.py", line 145, in _train_one_batch
criterions, end)
File "/workspace/hdd/06_model_compression/dmcp/runner/us_runner.py", line 201, in _train_one_batch
out = self.model(x)
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 151, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 156, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/root/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 162, in replicate
param_idx = param_indices[param]
KeyError: Parameter containing:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
device='cuda:0', requires_grad=True)
dmcp.yaml - mbv2 training: epoch: 40 sandwich: sample_type: offset max_width: &max_width 1.5 min_width: &min_width 0.1 width_offset: &width_offset 0.1 num_sample: 4 label_smooth: 0.1 distillation: enable: true temperature: 1 loss_weight: 1 hard_label: False
arch: target_flops: None train_freq: 1 sample_type: [max, min, scheduled_random, scheduled_random] floss_type: log_l1 flop_loss_weight: 0.1 num_flops_stats_sample: 3000 num_model_sample: 5
validation: width: [*max_width] calibration: enable: True num_batch: 5
evaluation: width: [*max_width] calibration: enable: True num_batch: 5
model: type: DMCPMobileNetV2 kwargs: num_classes: &num_classes 10 input_size: &input_size 32 width: [*min_width, *max_width, *width_offset] prob_type: sigmoid
runner:
type: DMCPRunner
recover: enable: False checkpoint: None
distributed: enable: False
optimizer: momentum: 0.9 weight_decay: 0.00004 nesterov: True no_wd: True
lr_scheduler: base_lr: 0.2 warmup_lr: 0.5 warmup_steps: 1000 min_lr: 0.08
arch_lr_scheduler: base_lr: 0.5 warmup_lr: 0.5 min_lr: 0.1
dataset: type: CIFAR10 augmentation: test_resize: 32 color_jitter: [0.2, 0.2, 0.2, 0.1] workers: 4 batch_size: 64 num_classes: *num_classes input_size: *input_size
logging: print_freq: 50
random_seed: 0 save_path: ./results
Hello, I am also running this code and I got the same error with you. Have you solved it yet? Thank you very much
Parallel Problem (GPU num >1) solution:cancelling code [model=nn.DataParallel.xxx.to(device)] by [model=model.to(device)]