Palette-Image-to-Image-Diffusion-Models
Palette-Image-to-Image-Diffusion-Models copied to clipboard
`Caught IndexError in DataLoader worker process 0` using `pip` installations
Setup
Running on Windows Subsystem for Linux 2 (WSL2).
git clone https://github.com/Janspiry/Palette-Image-to-Image-Diffusion-Models.git
cd Palette-Image-to-Image-Diffusion-Models
conda create -n pip-palette python==3.9.*
conda activate pip-palette
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu113
pip install -r requirements.txt
Config
Same as #21
Directory Structure
Same as #21
Terminal
(pip-palette) sgbaird@Dell-G7:~/GitHub/Palette-Image-to-Image-Diffusion-Models$ cd /home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models ; /usr/bin/env /home/sgbaird/miniconda3/envs/palette/bin/python /home/sgbaird/.vscode-server/extensions/ms-python.python-2022.8.0/pythonFiles/lib/python/debugpy/launcher 36177 -- /home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py -p train -c config/inpainting_celebahq_dummy.json --debug
export CUDA_VISIBLE_DEVICES=0
/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py:28: UserWarning: You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True
warnings.warn('You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True')
(pip-palette) sgbaird@Dell-G7:~/GitHub/Palette-Image-to-Image-Diffusion-Models$ cd /home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models ; /usr/bin/env /home/sgbaird/miniconda3/envs/pip-palette/bin/python /home/sgbaird/.vscode-server/extensions/ms-python.python-2022.8.0/pythonFiles/lib/python/debugpy/launcher 41379 -- /home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py -p train -c config/inpainting_celebahq_dummy.json --debug
export CUDA_VISIBLE_DEVICES=0
/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py:28: UserWarning: You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True
warnings.warn('You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True')
0%| | 0/16 [00:00<?, ?it/s]
Close the Tensorboard SummaryWriter.
Error
Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/dataset.py", line 471, in __getitem__
return self.dataset[self.indices[idx]]
File "/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/data/dataset.py", line 54, in __getitem__
path = self.imgs[index]
IndexError: list index out of range
File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/_utils.py", line 457, in reraise
raise exception
File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
data.reraise()
File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
return self._process_data(data)
File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
data = self._next_data()
File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/models/model.py", line 106, in train_step
for train_data in tqdm.tqdm(self.phase_loader):
File "/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/core/base_model.py", line 45, in train
train_log = self.train_step()
File "/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py", line 58, in main_worker
model.train()
File "/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py", line 92, in <module>
main_worker(0, 1, opt)
File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/runpy.py", line 268, in run_path
return _run_module_code(code, init_globals, run_name,
File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/runpy.py", line 197, in _run_module_as_main (Current frame)
return _run_code(code, main_globals, None,
https://stackoverflow.com/a/62550189/13697228 mentions data length needing to be divisible by batch_size
. Changed batch_size
to 1 everywhere and same issue.
Here's the log:
22-06-09 23:28:39.190 - INFO: Create the log file in directory experiments/debug_inpainting_celebahq_220609_232838.
22-06-09 23:28:39.259 - INFO: Dataset [InpaintDataset() form data.dataset] is created.
22-06-09 23:28:39.260 - INFO: Dataset for train have 48 samples.
22-06-09 23:28:39.260 - INFO: Dataset for val have 2 samples.
22-06-09 23:28:39.780 - INFO: Network [Network() form models.network] is created.
22-06-09 23:28:39.781 - INFO: Network [Network] weights initialize using [kaiming] method.
22-06-09 23:28:40.080 - WARNING: Config is a str, converts to a dict {'name': 'mae'}
22-06-09 23:28:40.459 - INFO: Metric [mae() form models.metric] is created.
22-06-09 23:28:40.459 - WARNING: Config is a str, converts to a dict {'name': 'mse_loss'}
22-06-09 23:28:40.468 - INFO: Loss [mse_loss() form models.loss] is created.
22-06-09 23:28:45.991 - INFO: Beign loading pretrained model [Network] ...
22-06-09 23:28:45.992 - WARNING: Pretrained model in [experiments/train_inpainting_celebahq_220426_233652/checkpoint/190_Network.pth] is not existed, Skip it
22-06-09 23:28:45.992 - INFO: Beign loading pretrained model [Network_ema] ...
22-06-09 23:28:45.992 - WARNING: Pretrained model in [experiments/train_inpainting_celebahq_220426_233652/checkpoint/190_Network_ema.pth] is not existed, Skip it
22-06-09 23:28:46.007 - INFO: Beign loading training states
22-06-09 23:28:46.007 - WARNING: Training state in [experiments/train_inpainting_celebahq_220426_233652/checkpoint/190.state] is not existed, Skip it
22-06-09 23:28:46.018 - INFO: Model [Palette() form models.model] is created.
22-06-09 23:28:46.019 - INFO: Begin model train.
Feel free to reopen the issue if there is any question
@Janspiry if you close the issue, the person that originally opened it can't reopen the issue.
How do you suggest I fix the error, Caught IndexError in DataLoader worker process 0.
so that I can actually run the code in this repository? My colleague @hasan-sayeed and I haven't been able to get Palette running at all, despite spending many hours debugging one issue after another.
Sorry for the error, I thought you guys had fixed it. Since the message says Caught IndexError, I suspect that the self.image the dataseat are reading may be incorrect. You can try printing this variable. Also can you show me the file directory and the contents of train.flist
@Janspiry thanks for the response. Will take another look and post back.
Hi @Janspiry @sgbaird. I am facing a similar issue when running the test script. Maybe they are related because of the way in which data indexing is implemented.
92%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 12/13 [1:00:07<05:00, 30
0.59s/it]
Close the Tensorboard SummaryWriter.
Traceback (most recent call last):
File "/scratch/aniruddha/Palette-Image-to-Image-Diffusion-Models/run.py", line 92, in <module>
main_worker(0, 1, opt)
File "/scratch/aniruddha/Palette-Image-to-Image-Diffusion-Models/run.py", line 60, in main_worker
model.test()
File "/scratch/aniruddha/Palette-Image-to-Image-Diffusion-Models/models/model.py", line 190, in test
self.writer.save_images(self.save_current_results())
File "/scratch/aniruddha/Palette-Image-to-Image-Diffusion-Models/models/model.py", line 87, in save_current_results
ret_path.append('GT_{}'.format(self.path[idx]))
IndexError: list index out of range
I am running test on 100 images with batch size of 8. As you can see from the logs, there are 13 batches (12 batches with 8 images and the last batch with 4 images). The run fails only on the last batch. The reason is that the line here looks for 8 images (batch size) in the last batch even though there are only 4.
https://github.com/Janspiry/Palette-Image-to-Image-Diffusion-Models/blob/main/models/model.py#L86
The test script runs fine when I use a multiple of 8 images. Could you let me know the easiest fix to this? Thanks.
I was able to solve the problem by getting the number of images in the batch explicitly.
temp_batch_size = len(self.path)
for idx in range(temp_batch_size):
ret_path.append('GT_{}'.format(self.path[idx]))
ret_result.append(self.gt_image[idx].detach().float().cpu())
ret_path.append('Process_{}'.format(self.path[idx]))
ret_result.append(self.visuals[idx::temp_batch_size].detach().float().cpu())
ret_path.append('Out_{}'.format(self.path[idx]))
ret_result.append(self.visuals[idx-temp_batch_size].detach().float().cpu())
@ani0075, thanks for suggesting this. I will fix it asap.
Setup
Running on Windows Subsystem for Linux 2 (WSL2).
git clone https://github.com/Janspiry/Palette-Image-to-Image-Diffusion-Models.git cd Palette-Image-to-Image-Diffusion-Models conda create -n pip-palette python==3.9.* conda activate pip-palette pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu113 pip install -r requirements.txt
Config
Same as #21
Directory Structure
Same as #21
Terminal
(pip-palette) sgbaird@Dell-G7:~/GitHub/Palette-Image-to-Image-Diffusion-Models$ cd /home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models ; /usr/bin/env /home/sgbaird/miniconda3/envs/palette/bin/python /home/sgbaird/.vscode-server/extensions/ms-python.python-2022.8.0/pythonFiles/lib/python/debugpy/launcher 36177 -- /home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py -p train -c config/inpainting_celebahq_dummy.json --debug export CUDA_VISIBLE_DEVICES=0 /home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py:28: UserWarning: You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True warnings.warn('You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True') (pip-palette) sgbaird@Dell-G7:~/GitHub/Palette-Image-to-Image-Diffusion-Models$ cd /home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models ; /usr/bin/env /home/sgbaird/miniconda3/envs/pip-palette/bin/python /home/sgbaird/.vscode-server/extensions/ms-python.python-2022.8.0/pythonFiles/lib/python/debugpy/launcher 41379 -- /home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py -p train -c config/inpainting_celebahq_dummy.json --debug export CUDA_VISIBLE_DEVICES=0 /home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py:28: UserWarning: You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True warnings.warn('You have chosen to use cudnn for accleration. torch.backends.cudnn.enabled=True') 0%| | 0/16 [00:00<?, ?it/s] Close the Tensorboard SummaryWriter.
Error
Caught IndexError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/dataset.py", line 471, in __getitem__ return self.dataset[self.indices[idx]] File "/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/data/dataset.py", line 54, in __getitem__ path = self.imgs[index] IndexError: list index out of range File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/_utils.py", line 457, in reraise raise exception File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data data.reraise() File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data return self._process_data(data) File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in __next__ data = self._next_data() File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__ for obj in iterable: File "/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/models/model.py", line 106, in train_step for train_data in tqdm.tqdm(self.phase_loader): File "/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/core/base_model.py", line 45, in train train_log = self.train_step() File "/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py", line 58, in main_worker model.train() File "/home/sgbaird/GitHub/Palette-Image-to-Image-Diffusion-Models/run.py", line 92, in <module> main_worker(0, 1, opt) File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/runpy.py", line 97, in _run_module_code _run_code(code, mod_globals, init_globals, File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/runpy.py", line 268, in run_path return _run_module_code(code, init_globals, run_name, File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/sgbaird/miniconda3/envs/pip-palette/lib/python3.9/runpy.py", line 197, in _run_module_as_main (Current frame) return _run_code(code, main_globals, None,
https://stackoverflow.com/a/62550189/13697228 mentions data length needing to be divisible by
batch_size
. Changedbatch_size
to 1 everywhere and same issue.Here's the log:
22-06-09 23:28:39.190 - INFO: Create the log file in directory experiments/debug_inpainting_celebahq_220609_232838. 22-06-09 23:28:39.259 - INFO: Dataset [InpaintDataset() form data.dataset] is created. 22-06-09 23:28:39.260 - INFO: Dataset for train have 48 samples. 22-06-09 23:28:39.260 - INFO: Dataset for val have 2 samples. 22-06-09 23:28:39.780 - INFO: Network [Network() form models.network] is created. 22-06-09 23:28:39.781 - INFO: Network [Network] weights initialize using [kaiming] method. 22-06-09 23:28:40.080 - WARNING: Config is a str, converts to a dict {'name': 'mae'} 22-06-09 23:28:40.459 - INFO: Metric [mae() form models.metric] is created. 22-06-09 23:28:40.459 - WARNING: Config is a str, converts to a dict {'name': 'mse_loss'} 22-06-09 23:28:40.468 - INFO: Loss [mse_loss() form models.loss] is created. 22-06-09 23:28:45.991 - INFO: Beign loading pretrained model [Network] ... 22-06-09 23:28:45.992 - WARNING: Pretrained model in [experiments/train_inpainting_celebahq_220426_233652/checkpoint/190_Network.pth] is not existed, Skip it 22-06-09 23:28:45.992 - INFO: Beign loading pretrained model [Network_ema] ... 22-06-09 23:28:45.992 - WARNING: Pretrained model in [experiments/train_inpainting_celebahq_220426_233652/checkpoint/190_Network_ema.pth] is not existed, Skip it 22-06-09 23:28:46.007 - INFO: Beign loading training states 22-06-09 23:28:46.007 - WARNING: Training state in [experiments/train_inpainting_celebahq_220426_233652/checkpoint/190.state] is not existed, Skip it 22-06-09 23:28:46.018 - INFO: Model [Palette() form models.model] is created. 22-06-09 23:28:46.019 - INFO: Begin model train.
Sorry to bother you, did you reproduce this code in the end