pix2pixHD icon indicating copy to clipboard operation
pix2pixHD copied to clipboard

Questions, Training with your own dataset..

Open edwardcho opened this issue 2 years ago • 6 comments

Hello Sir,

I have interesting on Image-to-image translation.

Question 1. I trained your code using your sample-data (cityscape). When training, I didn't use 'train_inst'. Then I got some strange results. image Blurring, I could't get your results. How to get your results??

Question 2. I want to train my own datasets. My dataset's size 512 x 512 (grayscale) According to your script (script/train_512p.sh), I tried train. But I met some error.

...
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [347,0,0], thread: [62,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [347,0,0], thread: [63,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
  File "train.py", line 71, in <module>
    Variable(data['image']), Variable(data['feat']), infer=save_fake)
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data1/TESTBOARD/additional_networks/generation/pix2pixHD_NVIDIA/models/pix2pixHD_model.py", line 165, in forward
    fake_image = self.netG.forward(input_concat)
  File "/data1/TESTBOARD/additional_networks/generation/pix2pixHD_NVIDIA/models/networks.py", line 211, in forward
    return self.model(input)
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 443, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 440, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f55de791a22 in /home/itsme/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10aa3 (0x7f55de9f2aa3 in /home/itsme/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7f55de9f4147 in /home/itsme/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f55de77b5a4 in /home/itsme/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0xa2f382 (0x7f56835a0382 in /home/itsme/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0xa2f421 (0x7f56835a0421 in /home/itsme/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #19: __libc_start_main + 0xe7 (0x7f569705fb97 in /lib/x86_64-linux-gnu/libc.so.6)

What should I do check about my problem for solving??

Thanks. Edward Cho.

edwardcho avatar Feb 03 '22 05:02 edwardcho

I have the same question....

tonylin52 avatar May 06 '22 02:05 tonylin52

@edwardcho I have same problem. did you solve this?

daeunni avatar Aug 01 '22 09:08 daeunni

This error may be caused by one of the following reasons:

Insufficient GPU memory. If the GPU memory is not enough, it may cause CUDA internal errors. Make sure your GPU memory is sufficient, and try reducing batch size or input image resolution.

Incompatible CUDA version. Make sure your CUDA version is compatible with your PyTorch version. If the CUDA version is incompatible, it may cause CUDA internal errors. You can check the PyTorch documentation to determine which CUDA versions are compatible with which PyTorch versions.

Installation issues. Make sure you have installed PyTorch and related dependencies correctly. You can try reinstalling PyTorch and related dependencies, or try installing using conda or pip.

Model or dataset issues. This error may be caused by issues in the model or dataset. Make sure your model and dataset are correct, and there are no missing files or folders.

HyperSimon avatar Apr 11 '23 07:04 HyperSimon

Hello, I meet this Issue and solved it. This is the doc of authority: If your input is not a label map, please just specify --label_nc 0 which will directly use the RGB colors as input. The folders should then be named train_A, train_B instead of train_label, train_img, where the goal is to translate images from A to B. It's mean the label must be integer, but your label is not integer, use --label_nc 0

Forgetmypass avatar Jun 01 '23 09:06 Forgetmypass

Hello, I meet this Issue and solved it. This is the doc of authority: If your input is not a label map, please just specify --label_nc 0 which will directly use the RGB colors as input. The folders should then be named train_A, train_B instead of train_label, train_img, where the goal is to translate images from A to B. It's mean the label must be integer, but your label is not integer, use --label_nc 0

If I want to add a label map to the translation from image A to image B, what do I need to do? Looking forward to your response.

masonghao1 avatar Nov 08 '23 10:11 masonghao1

Met the same problem. Then I found my dataset all 'JPEG', so I set --label_nc 0 and solved it.

PuKuang avatar Mar 02 '24 11:03 PuKuang