SPSR icon indicating copy to clipboard operation
SPSR copied to clipboard

About HR_size and CUDA memory

Open stdinR opened this issue 1 year ago • 0 comments

Hello, do you know if I train another dataset (HR size = 480*480),which is different from your dataset, should I change the "HR_size":128 in tain_spsr.json? If so, I'd change the 128 to 480, then I can't run the train code. It ends up with something like

23-06-17 14:54:34.882 - INFO: Start training from epoch: 0, iter: 0 Traceback (most recent call last): File "train.py", line 182, in main() File "train.py", line 105, in main model.optimize_parameters(current_step) File "/home//SPSR-master/code/models/SPSR_model.py", line 282, in optimize_parameters pred_g_fake = self.netD(self.fake_H) File "/home//anaconda3/envs/RRSGAN37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, **kwargs) File "/home//anaconda3/envs/RRSGAN37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home//anaconda3/envs/RRSGAN37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home//anaconda3/envs/RRSGAN37/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/home//anaconda3/envs/RRSGAN37/lib/python3.7/site-packages/torch/_utils.py", line 457, in reraise raise exception RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/home//anaconda3/envs/RRSGAN37/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(input, **kwargs) File "/home//anaconda3/envs/RRSGAN37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, **kwargs) File "/home//SPSR-master/code/models/modules/architecture.py", line 247, in forward x = self.classifier(x) File "/home//anaconda3/envs/RRSGAN37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, **kwargs) File "/home//anaconda3/envs/RRSGAN37/lib/python3.7/site-packages/torch/nn/modules/container.py", line 141, in forward input = module(input) File "/home//anaconda3/envs/RRSGAN37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, **kwargs) File "/home//anaconda3/envs/RRSGAN37/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 103, in forward return F.linear(input, self.weight, self.bias) RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x115200 and 8192x100)

To be noted that, within this error, '1*115200' is relevant to my batch size. When I use default batch size, it would turn out 'Out of CUDA Memory', so I had to set batch_size to 1.

If I don't change the HR_size (default = 128), then I can run the train code. But I don't know such training on my dataset (HR size = 480*480) would be appropriate or not?

I'm looking forward to your reply! THANKS !

stdinR avatar Jun 17 '23 07:06 stdinR