diffusion train error during evaluation with 1 GPU and train with multi GPU

Hi thanks for this contribution as a small exercise I am training SD2 on the pokemon dataset I precomputed the latents and it starts training on one gpu However at the evaluation time I get the following error

File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/composer/trainer/trainer.py", line 2814, in _eval_loop
    self.state.outputs = self._original_model.eval_forward(self.state.batch)
  File "/fsx_vfx/users/csegalin/code/diffusion/diffusion/models/stable_diffusion.py", line 255, in eval_forward
    gen_images = self.generate(tokenized_prompts=prompts,
  File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/fsx_vfx/users/csegalin/code/diffusion/diffusion/models/stable_diffusion.py", line 464, in generate
    pred = self.unet(latent_model_input,
  File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py", line 934, in forward
    sample = self.conv_in(sample)
  File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (162 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size`

this is my confguration

name: trial0 # Insert wandb run name
project: pokemon_sd2_256 # Insert wandb project name
seed: 17
eval_first: false
algorithms:
  low_precision_groupnorm:
    attribute: unet
    precision: amp_fp16
  low_precision_layernorm:
    attribute: unet
    precision: amp_fp16
model:
  _target_: diffusion.models.models.stable_diffusion_2
  pretrained: false
  precomputed_latents: true
  encode_latents_in_fp16: true
  fsdp: true
  val_metrics:
    - _target_: torchmetrics.MeanSquaredError
    - _target_: torchmetrics.image.fid.FrechetInceptionDistance
      normalize: true
  val_guidance_scales: [3, 7]
  # val_guidance_scales: []
  loss_bins: []
dataset:
  train_batch_size: 1 # Global training batch size
  eval_batch_size: 1  # Global evaluation batch size
  train_dataset:
    _target_: diffusion.datasets.pokemon.pokemon.build_streaming_dataloader
      # Path to object store bucket(s)
    local: /fsx_vfx/users/csegalin/data/pokemon/latents2_train
      # Path to corresponding local dataset(s)
    mode: 0
    version: 2
    drop_last: False
    shuffle: true
    prefetch_factor: 2
    num_workers: 8
    persistent_workers: true
    pin_memory: true
  eval_dataset:
    _target_: diffusion.datasets.pokemon.pokemon.build_streaming_dataloader
    local: /fsx_vfx/users/csegalin/data/pokemon/latents2_eval # Path to local dataset cache
    prefetch_factor: 2
    num_workers: 8
    persistent_workers: True
    pin_memory: True
    mode: 0
    version: 2
optimizer:
  _target_: torch.optim.AdamW
  lr: 1.0e-5
  weight_decay: 0.01
scheduler:
  _target_: composer.optim.LinearWithWarmupScheduler
  t_warmup: 1000ba
  alpha_f: 1.0
logger:
  comet-ml:
    _target_: composer.loggers.cometml_logger.CometMLLogger
    name: ${name}
    project_name: ${project}
callbacks:
  speed_monitor:
    _target_: composer.callbacks.speed_monitor.SpeedMonitor
    window_size: 10
  lr_monitor:
    _target_: composer.callbacks.lr_monitor.LRMonitor
  memory_monitor:
    _target_: composer.callbacks.memory_monitor.MemoryMonitor
  runtime_estimator:
    _target_: composer.callbacks.runtime_estimator.RuntimeEstimator
  optimizer_monitor:
    _target_: composer.callbacks.OptimizerMonitor
  image_monitor:
    _target_: diffusion.callbacks.log_diffusion_images.LogDiffusionImages
    prompts: # add any prompts you would like to visualize
    - cute dragon creature
    size: 256 # generated image resolution
    guidance_scale: 3
trainer:
  _target_: composer.Trainer
  device: gpu
  max_duration: 550000ba
  eval_interval: 1000ba
  device_train_microbatch_size: 1
  run_name: ${name}
  seed: ${seed}
  save_folder:  trained_model # Insert path to save folder or bucket
  save_interval: 3000ba
  save_overwrite: true
  autoresume: false
  # fsdp_config:
  #   sharding_strategy: "SHARD_GRAD_OP"

``

Oct 13 '23 21:10 segalinc

I think this related to the FID metrics as if I remove it all works

Oct 13 '23 21:10 segalinc

when I try to train on a multi gpu machine (resetting fspd to true) and uncommenting last two lines of the config and batch size accordingly I get this error

ValueError: The world_size(2) > 1 but dataloader does not use DistributedSampler. This will cause all ranks to train on the same data, removing any benefit from multi-GPU training. To resolve this, create a Dataloader with DistributedSampler. For example, DataLoader(..., sampler=composer.utils.dist.get_sampler(...)).Alternatively, the process group can be instantiated with composer.utils.dist.instantiate_dist(...) and DistributedSampler can directly be created with DataLoader(..., sampler=DistributedSampler(...)). For more information, see https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler.

I don't see a distributesampler for the laion or coco functions

Oct 13 '23 22:10 segalinc