wandb icon indicating copy to clipboard operation
wandb copied to clipboard

[CLI]: PIL.UnidentifiedImageError using wandb logger with pytorch lightning

Open denisbeslic opened this issue 2 years ago • 3 comments

Describe the bug

Using wandb as logger of my pytorch lightning trainer causes a PIL.UnidentifiedImageError, although the images are not corrupted. I added a manual check for corrupted images but the error still occurs. Error happens sometimes at the start of training, sometimes after 20 epochs.

This function is called during the validation loop

def generate_validation_plots(self, prediction, targets, data, bs=12):
    if bs > targets.shape[0]:
        bs = targets.shape[0]
    targets = targets[:bs, :]
    prediction = prediction[:bs, :]
    data = data[:bs, :]

    log_dir = "logs-" + self.config["log_name"]
    
    batch_dir = os.path.join(log_dir, f"epoch_{self.current_epoch}")

    # Check if images are corrupted
    for filename in os.listdir(batch_dir):
      try:
          image = Image.open(os.path.join(batch_dir, filename))
      except PIL.UnidentifiedImageError as e:
          print(f"Error in file {filename}: {e}")
          os.remove(os.path.join(batch_dir, filename))
          print(f"Removed file {filename}")

    images_l = os.listdir(batch_dir)
    images_path = [os.path.join(batch_dir, img) for img in images_l]

    # For now deactivate since it caused problems
    self.logger.log_image(
        key="sample_images",
        images=images_path,
        caption=[
            i.split(".png")[0]for i in images_l
        ],
    )


Sanity Checking: |          | 0/? [00:00<?, ?it/s]seq2signal INFO 09:11:17: True Validation dataset size 1054710
Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
  File "xx/Project-003-Nanopore-Simulator/Nanopore-Simulator/SSS-newmodel/src/seq2signal/seq2signal.py", line 263, in <module>
    main()
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xx.conda/envs/seq2signal/lib/python3.11/site-packages/rich_click/rich_command.py", line 126, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "xx.conda/envs/seq2signal/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xx.conda/envs/seq2signal/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xx.conda/envs/seq2signal/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xx/seq2signal.py", line 142, in train
    train_run(npy_dir, config, model)
  File "xx/train.py", line 70, in train_run
    trainer.fit(model, poredata)
  File "xx/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "xx/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1033, in _run_stage
    self._run_sanity_check()
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1062, in _run_sanity_check
    val_loop.run()
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 134, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 391, in _evaluation_step
    output = call._call_strategy_hook(trainer, hook_name, *step_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 402, in validation_step
    return self._forward_redirection(self.model, self.lightning_module, "validation_step", *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 633, in __call__
    wrapper_output = wrapper_module(*args, **kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xx.conda/envs/seq2signal/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/beslicd/.conda/envs/seq2signal/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 626, in wrapped_forward
    out = method(*_args, **_kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xx/model.py", line 164, in validation_step
    generate_validation_plots(self, prediction, targets, data)
  File "xx/utils.py", line 159, in generate_validation_plots
    self.logger.log_image(
  File "xx/lib/python3.11/site-packages/lightning_utilities/core/rank_zero.py", line 32, in wrapped_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/pytorch_lightning/loggers/wandb.py", line 482, in log_image
    metrics = {key: [wandb.Image(img, **kwarg) for img, kwarg in zip(images, kwarg_list)]}
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/pytorch_lightning/loggers/wandb.py", line 482, in <listcomp>
    metrics = {key: [wandb.Image(img, **kwarg) for img, kwarg in zip(images, kwarg_list)]}
                     ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/wandb/sdk/data_types/image.py", line 177, in __init__
    self._initialize_from_path(data_or_path)
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/wandb/sdk/data_types/image.py", line 276, in _initialize_from_path
    self._image = pil_image.open(path)
                  ^^^^^^^^^^^^^^^^^^^^
  File "xx/.conda/envs/seq2signal/lib/python3.11/site-packages/PIL/Image.py", line 3280, in open
    raise UnidentifiedImageError(msg)
PIL.UnidentifiedImageError: cannot identify image file 'logs-LargeTrain150-16DNA300Signal-07-DILATE/epoch_0/batch_11.png'
wandb: 🚀 View run LargeTrain150-16DNA300Signal-07-DILATE at: xx
wandb: ️⚡ View job at xx
wandb: Synced 6 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/xx

Additional Files

No response

Environment

WandB version: wandb==0.16.2

OS: Ubuntu 20.04.5 LTS (Focal Fossa)

Python version: 3.11.5

Pytorch: 2.1.2

Pytorch Lightning: 2.1.3

Additional Context

No response

denisbeslic avatar Feb 06 '24 18:02 denisbeslic

I noticed that this bug is only thrown if I use multiple GPUs. So I guess this bug is caused by the interaction of multiple GPUs. Is there a way to only select a single GPU for the logging part / this function?

denisbeslic avatar Feb 15 '24 08:02 denisbeslic

Hi @denisbeslic, I think you might be looking for the single process method described here in the distributed training docs

nate-wandb avatar Feb 21 '24 20:02 nate-wandb

Hey @denisbeslic were you able to resolve the issue in accordance with @nate-wandb's comment.

ayulockin avatar Mar 29 '24 16:03 ayulockin