I followed the Training on ScanNet in README.
Phase 1 (the first 0-20 epoch), training single fragments. MODEL.FUSION.FUSION_ON=False, MODEL.FUSION.FULL=False, the model cannot work and got the error.
Traceback (most recent call last):
File "main.py", line 301, in
train()
File "main.py", line 205, in train
loss, scalar_outputs = train_sample(sample)
File "main.py", line 281, in train_sample
outputs, loss_dict = model(sample)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 619, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/neuralrecon/neucon/models/neuralrecon.py", line 87, in forward
outputs, loss_dict = self.neucon_net(features, inputs, outputs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/neuralrecon/neucon/models/neucon_network.py", line 158, in forward
feat = self.sp_convsi
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/neuralrecon/neucon/models/modules.py", line 156, in forward
x2 = self.stage2(x1)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/neuralrecon/neucon/models/modules.py", line 25, in forward
out = self.net(x)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torchsparse/nn/modules/norm.py", line 13, in forward
return fapply(input, super().forward)
File "/opt/conda/lib/python3.7/site-packages/torchsparse/nn/utils/apply.py", line 12, in fapply
feats = fn(input.feats, *args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 136, in forward
self.weight, self.bias, bn_training, exponential_average_factor, self.eps)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2054, in batch_norm
_verify_batch_size(input.size())
File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2037, in _verify_batch_size
raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 32])
I have the same problem. Increase the batch size when training without GRU fusion.
@bityigoss did you solve it? @florinhegedus i increase the bs, but i stil met the problem.
To clarify what my problem was: when training after one or two epochs I got the error on some specific data point (the training stopped at the same iteration step everytime I tried). I was training with batch_size=1 and changing it to batch_size=2 helped me avoid the problem and I never encountered it since then. I'm currently 10 epochs into training. I don't really know what the problem was, if it is related to the batch size or not.
@bityigoss did you solve it? @florinhegedus i increase the bs, but i stil met the problem.
@florinhegedus thanks, may the situation i a little diffrent from me.
@bityigoss did you solve it? @florinhegedus i increase the bs, but i stil met the problem.
Sorry, I didn't find a solution
I was training with batch_size=1 and changing it to batch_size=2
Correct me if I am wrong, but I guess this problem is caused by the insufficient fragments amount during the training in the second phase. Probably need to ensure that FUSION_ON
works in forward()
.