DB
DB copied to clipboard
Error when training model with MobileNetv3-Small backbone
I've successfully trained a MobileNetv3-Large backbone on ICDAR 2015. (See here for results.) However, I get the error below when trying to train a model with a MobileNetv3-Small backbone. @Microkitty, any suggestions?
Training command and resulting error:
$ CUDA_VISIBLE_DEVICES=0 python train.py experiments/seg_detector/ic15_mobilenet_v3_small_thre.yaml --num_gpus 1
[INFO] [2021-06-10 16:08:49,621] Training epoch 0
Traceback (most recent call last):
File "train.py", line 70, in <module>
main()
File "train.py", line 67, in main
trainer.train()
File "/home/mroos/Code/gatekeeper_differentiable_binarization/trainer.py", line 86, in train
epoch=epoch, step=self.steps)
File "/home/mroos/Code/gatekeeper_differentiable_binarization/trainer.py", line 109, in train_step
results = model.forward(batch, training=True)
File "/home/mroos/Code/gatekeeper_differentiable_binarization/structure/model.py", line 56, in forward
pred = self.model(data, training=self.training)
File "/home/mroos/python_envs/env_torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mroos/python_envs/env_torch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 159, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/mroos/python_envs/env_torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mroos/Code/gatekeeper_differentiable_binarization/structure/model.py", line 19, in forward
return self.decoder(self.backbone(data), *args, **kwargs)
File "/home/mroos/python_envs/env_torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mroos/Code/gatekeeper_differentiable_binarization/backbones/mobilenetv3.py", line 211, in forward
x = self.features[stage](x)
File "/home/mroos/python_envs/env_torch/lib/python3.6/site-packages/torch/nn/modules/container.py", line 164, in __getitem__
return self._modules[self._get_abs_string_index(idx)]
File "/home/mroos/python_envs/env_torch/lib/python3.6/site-packages/torch/nn/modules/container.py", line 154, in _get_abs_string_index
raise IndexError('index {} is out of range'.format(idx))
IndexError: index 16 is out of range
This is my .yaml file:
import:
- 'experiments/seg_detector/base_ic15.yaml'
package: []
define:
- name: 'Experiment'
class: Experiment
structure:
class: Structure
builder:
class: Builder
model: SegDetectorModel
model_args:
backbone: mobilenet_v3_small
decoder: SegDetector
decoder_args:
adaptive: True
in_channels: [24, 40, 112, 960]
k: 50
loss_class: L1BalanceCELoss
representer:
class: SegDetectorRepresenter
max_candidates: 1000
measurer:
class: QuadMeasurer
visualizer:
class: SegDetectorVisualizer
train:
class: TrainSettings
data_loader:
class: DataLoader
dataset: ^train_data
batch_size: 8
num_workers: 4
checkpoint:
class: Checkpoint
start_epoch: 0
start_iter: 0
resume: null
model_saver:
class: ModelSaver
dir_path: model
save_interval: 1000
signal_path: save
scheduler:
class: OptimizerScheduler
optimizer: "SGD"
optimizer_args:
lr: 0.007
momentum: 0.9
weight_decay: 0.0001
learning_rate:
class: DecayLearningRate
epochs: 1200
epochs: 1200
validation: &validate
class: ValidationSettings
data_loaders:
icdar2015:
class: DataLoader
dataset: ^validate_data
batch_size: 1
num_workers: 16
collect_fn:
class: ICDARCollectFN
visualize: false
interval: 1000
exempt: 1
logger:
class: Logger
verbose: true
level: info
log_interval: 1000
evaluation: *validate
I've discovered part of the problem but my attempt at fixing it still results in an error. The backbone layers we need to draw from in the Small model are different than those in the Large model.
I made this change to the .yaml file:
in_channels: [24, 40, 96, 576]
See these values of Table 2 from the original publication and in mobilenetv3.py for reference:
elif mode == 'small':
# refer to Table 2 in paper
mobile_setting = [
# k, exp, c, se, nl, s,
[3, 16, 16, True, 'RE', 2],
[3, 72, 24, False, 'RE', 2],
[3, 88, 24, False, 'RE', 1], ### 3
[5, 96, 40, True, 'HS', 2],
[5, 240, 40, True, 'HS', 1],
[5, 240, 40, True, 'HS', 1], ### 6
[5, 120, 48, True, 'HS', 1],
[5, 144, 48, True, 'HS', 1],
[5, 288, 96, True, 'HS', 2], ### 9
[5, 576, 96, True, 'HS', 1],
[5, 576, 96, True, 'HS', 1],
]
I then saved the mode ('small' or 'large') as a MobileNetV3 instance attribute, and used that in .forward() to identify the backbone output layers differently for the Small mobile and Large model.
def forward(self, x):
'''x = self.features(x)
x = x.mean(3).mean(2)
x = self.classifier(x)
return x'''
if self.mode=='large':
x2, x3, x4, x5 = None, None, None, None
for stage in range(17): # https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.1/ppocr/modeling/backbones/det_mobilenet_v3.py
x = self.features[stage](x)
if stage == 3: # if s == 2 and start_idx > 3
x2 = x
elif stage == 6:
x3 = x
elif stage == 12:
x4 = x
elif stage == 16:
x5 = x
return x2, x3, x4, x5
elif self.mode=='small':
x2, x3, x4, x5 = None, None, None, None
for stage in range(13): # https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.1/ppocr/modeling/backbones/det_mobilenet_v3.py
x = self.features[stage](x)
if stage == 3: # if s == 2 and start_idx > 3
x2 = x
elif stage == 6:
x3 = x
elif stage == 9:
x4 = x
elif stage == 12:
x5 = x
return x2, x3, x4, x5
else:
raise NotImplementedError
But now I get an error indicating one the layers has 2x more channels than expected at an upsample and sum command.
$ CUDA_VISIBLE_DEVICES=0 python train.py experiments/seg_detector/ic15_mobilenet_v3_small_thre.yaml --num_gpus 1
[INFO] [2021-06-10 17:29:43,453] Training epoch 0
Traceback (most recent call last):
File "train.py", line 70, in <module>
main()
File "train.py", line 67, in main
trainer.train()
File "/home/mroos/Code/gatekeeper_differentiable_binarization/trainer.py", line 86, in train
epoch=epoch, step=self.steps)
File "/home/mroos/Code/gatekeeper_differentiable_binarization/trainer.py", line 109, in train_step
results = model.forward(batch, training=True)
File "/home/mroos/Code/gatekeeper_differentiable_binarization/structure/model.py", line 56, in forward
pred = self.model(data, training=self.training)
File "/home/mroos/python_envs/env_torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mroos/python_envs/env_torch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 159, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/mroos/python_envs/env_torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mroos/Code/gatekeeper_differentiable_binarization/structure/model.py", line 19, in forward
return self.decoder(self.backbone(data), *args, **kwargs)
File "/home/mroos/python_envs/env_torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mroos/Code/gatekeeper_differentiable_binarization/decoders/seg_detector.py", line 124, in forward
out4 = self.up5(in5) + in4 # 1/16
RuntimeError: The size of tensor a (40) must match the size of tensor b (20) at non-singleton dimension 3
I've discovered part of the problem but my attempt at fixing it still results in an error. The backbone layers we need to draw from in the Small model are different than those in the Large model.
I made this change to the .yaml file:
in_channels: [24, 40, 96, 576]See these values of Table 2 from the original publication and in
mobilenetv3.pyfor reference:elif mode == 'small': # refer to Table 2 in paper mobile_setting = [ # k, exp, c, se, nl, s, [3, 16, 16, True, 'RE', 2], [3, 72, 24, False, 'RE', 2], [3, 88, 24, False, 'RE', 1], ### 3 [5, 96, 40, True, 'HS', 2], [5, 240, 40, True, 'HS', 1], [5, 240, 40, True, 'HS', 1], ### 6 [5, 120, 48, True, 'HS', 1], [5, 144, 48, True, 'HS', 1], [5, 288, 96, True, 'HS', 2], ### 9 [5, 576, 96, True, 'HS', 1], [5, 576, 96, True, 'HS', 1], ]I then saved the mode ('small' or 'large') as a
MobileNetV3instance attribute, and used that in.forward()to identify the backbone output layers differently for the Small mobile and Large model.def forward(self, x): '''x = self.features(x) x = x.mean(3).mean(2) x = self.classifier(x) return x''' if self.mode=='large': x2, x3, x4, x5 = None, None, None, None for stage in range(17): # https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.1/ppocr/modeling/backbones/det_mobilenet_v3.py x = self.features[stage](x) if stage == 3: # if s == 2 and start_idx > 3 x2 = x elif stage == 6: x3 = x elif stage == 12: x4 = x elif stage == 16: x5 = x return x2, x3, x4, x5 elif self.mode=='small': x2, x3, x4, x5 = None, None, None, None for stage in range(13): # https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.1/ppocr/modeling/backbones/det_mobilenet_v3.py x = self.features[stage](x) if stage == 3: # if s == 2 and start_idx > 3 x2 = x elif stage == 6: x3 = x elif stage == 9: x4 = x elif stage == 12: x5 = x return x2, x3, x4, x5 else: raise NotImplementedErrorBut now I get an error indicating one the layers has 2x more channels than expected at an upsample and sum command.
$ CUDA_VISIBLE_DEVICES=0 python train.py experiments/seg_detector/ic15_mobilenet_v3_small_thre.yaml --num_gpus 1 [INFO] [2021-06-10 17:29:43,453] Training epoch 0 Traceback (most recent call last): File "train.py", line 70, in <module> main() File "train.py", line 67, in main trainer.train() File "/home/mroos/Code/gatekeeper_differentiable_binarization/trainer.py", line 86, in train epoch=epoch, step=self.steps) File "/home/mroos/Code/gatekeeper_differentiable_binarization/trainer.py", line 109, in train_step results = model.forward(batch, training=True) File "/home/mroos/Code/gatekeeper_differentiable_binarization/structure/model.py", line 56, in forward pred = self.model(data, training=self.training) File "/home/mroos/python_envs/env_torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/mroos/python_envs/env_torch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 159, in forward return self.module(*inputs[0], **kwargs[0]) File "/home/mroos/python_envs/env_torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/mroos/Code/gatekeeper_differentiable_binarization/structure/model.py", line 19, in forward return self.decoder(self.backbone(data), *args, **kwargs) File "/home/mroos/python_envs/env_torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/mroos/Code/gatekeeper_differentiable_binarization/decoders/seg_detector.py", line 124, in forward out4 = self.up5(in5) + in4 # 1/16 RuntimeError: The size of tensor a (40) must match the size of tensor b (20) at non-singleton dimension 3
large is downsample 8 after step3, however samll is 4, you need fix "mobile_setting" or seg_decoder "forward"