medicaldiffusion icon indicating copy to clipboard operation
medicaldiffusion copied to clipboard

Warning: data is not aligned! This can lead to a speed loss

Open WhenMelancholy opened this issue 2 years ago • 3 comments

During the training process, I encountered the following warning outputs:

Sanity Checking DataLoader 0:   0%|                                                                                                     | 0/2 [00:00<?, ?it/s][swscaler @ 0x641c700] Warning: data is not aligned! This can lead to a speed loss
[swscaler @ 0x743a880] Warning: data is not aligned! This can lead to a speed loss
Epoch 0:   0%|                                                                                                                        | 0/565 [00:00<?, ?it/s][swscaler @ 0x59d9700] Warning: data is not aligned! This can lead to a speed loss
[swscaler @ 0x6c7f880] Warning: data is not aligned! This can lead to a speed loss

Although it did not affect the training, I am unclear about the reason behind this. My training instructions are as follows:

CUDA_VISIBLE_DEVICES=2 PL_TORCH_DISTRIBUTED_BACKEND=gloo PYTHONPATH=.:$PYTHONPATH python train/train_vqgan.py dataset=mrnet dataset.root_dir="~/github/medicaldiffusion/data/MRNet-v1.0/" model=vq_gan_3d model.gpus=1 model.default_root_dir="~/github/medicaldiffusion/when/checkpoints/vq_gan" model.default_root_dir_postfix="mrnet" model.precision=16 model.embedding_dim=8 model.n_hiddens=16 model.downsample=[4,4,4] model.num_workers=32 model.gradient_clip_val=1.0 model.lr=3e-4 model.discriminator_iter_start=10000 model.perceptual_weight=4 model.image_gan_weight=1 model.video_gan_weight=1 model.gan_feat_weight=4 model.batch_size=2 model.n_codes=16384 model.accumulate_grad_batches=1 

These instructions are referenced from train_vqgan.sh.

Thank you in advance!

WhenMelancholy avatar Aug 14 '23 08:08 WhenMelancholy

@WhenMelancholy This happened for me aswell, as far as I know this indicates that the number of images in your training data is not evenly divisible by the number of CUDA devices you're training on. This should only have a negligible impact on training as long as you're only training on one server. I believe this is a warning from PyTorch lightning.

benearnthof avatar Aug 16 '23 16:08 benearnthof

@benearnthof This happened for me aswell, could you please tell me how to debug? Is it because the dataset is not divisible by 16?

xiexing0916 avatar Sep 05 '23 15:09 xiexing0916

There is no reason to debug anything as this warning just indicates some minor inefficiencies when scaling images. My prior statement may be incorrect as this most likely stems from one of the image dimensions not being divisible by 16. This should not impact the model however

benearnthof avatar Sep 06 '23 08:09 benearnthof