NeuralRecon icon indicating copy to clipboard operation
NeuralRecon copied to clipboard

Error when training on ScanNet dataset

Open bityigoss opened this issue 3 years ago • 6 comments

I followed the Training on ScanNet in README. Phase 1 (the first 0-20 epoch), training single fragments. MODEL.FUSION.FUSION_ON=False, MODEL.FUSION.FULL=False, the model cannot work and got the error.

Traceback (most recent call last): File "main.py", line 301, in train() File "main.py", line 205, in train loss, scalar_outputs = train_sample(sample) File "main.py", line 281, in train_sample outputs, loss_dict = model(sample) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 619, in forward output = self.module(*inputs[0], **kwargs[0]) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/neuralrecon/neucon/models/neuralrecon.py", line 87, in forward outputs, loss_dict = self.neucon_net(features, inputs, outputs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/neuralrecon/neucon/models/neucon_network.py", line 158, in forward feat = self.sp_convsi File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/neuralrecon/neucon/models/modules.py", line 156, in forward x2 = self.stage2(x1) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/neuralrecon/neucon/models/modules.py", line 25, in forward out = self.net(x) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/torchsparse/nn/modules/norm.py", line 13, in forward return fapply(input, super().forward) File "/opt/conda/lib/python3.7/site-packages/torchsparse/nn/utils/apply.py", line 12, in fapply feats = fn(input.feats, *args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 136, in forward self.weight, self.bias, bn_training, exponential_average_factor, self.eps) File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2054, in batch_norm _verify_batch_size(input.size()) File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2037, in _verify_batch_size raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size)) ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 32])

bityigoss avatar Jan 04 '22 12:01 bityigoss

I have the same problem. Increase the batch size when training without GRU fusion.

florinhegedus avatar Feb 23 '22 07:02 florinhegedus

@bityigoss did you solve it? @florinhegedus i increase the bs, but i stil met the problem.

MrCrazyCrab avatar Feb 28 '22 14:02 MrCrazyCrab

To clarify what my problem was: when training after one or two epochs I got the error on some specific data point (the training stopped at the same iteration step everytime I tried). I was training with batch_size=1 and changing it to batch_size=2 helped me avoid the problem and I never encountered it since then. I'm currently 10 epochs into training. I don't really know what the problem was, if it is related to the batch size or not.

@bityigoss did you solve it? @florinhegedus i increase the bs, but i stil met the problem.

florinhegedus avatar Feb 28 '22 15:02 florinhegedus

@florinhegedus thanks, may the situation i a little diffrent from me.

MrCrazyCrab avatar Mar 01 '22 00:03 MrCrazyCrab

@bityigoss did you solve it? @florinhegedus i increase the bs, but i stil met the problem.

Sorry, I didn't find a solution

bityigoss avatar Mar 03 '22 08:03 bityigoss

I was training with batch_size=1 and changing it to batch_size=2

Correct me if I am wrong, but I guess this problem is caused by the insufficient fragments amount during the training in the second phase. Probably need to ensure that FUSION_ON works in forward().

HaFred avatar May 30 '22 03:05 HaFred