denoising-diffusion-pytorch icon indicating copy to clipboard operation
denoising-diffusion-pytorch copied to clipboard

local gpu can't run full in small sample cases

Open leiqianstat opened this issue 2 years ago • 5 comments

My data is a 32*320 matrix with 32 samples and 320 dimensions. But locally using 4090, each iteration takes 20s and the cpu usage is 99% and gpu is 1%. When I increase the sample size to 1000 or 10000, 20 iteration per second, and the cpu usage is 99%, gpu is 99%. When I ran the example with n=32 and p=320 on kaggle p100, I found that it was 3 iterations per second, and the cpu usage was 99% and the gpu was at 99%. I don't know what the problem is that the local gpu is much slower than on kaggle at n=32. Hopefully this can be fixed, here is my code.

import torch
from denoising_diffusion_pytorch import Unet1D, GaussianDiffusion1D, Trainer1D, Dataset1D

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = Unet1D(
    dim=64,
    dim_mults=(1, 2, 4, 8),
    channels=1
).to(device)

diffusion = GaussianDiffusion1D(
    model,
    seq_length=320,
    timesteps=100,
    objective='pred_v'
).to(device)


data = torch.randn(32,320)
training_seq = data.unsqueeze(1).float()  # convert to 32*1*320
dataset = Dataset1D(training_seq)

trainer = Trainer1D(
    diffusion,
    dataset=dataset,
    train_batch_size=64,
    train_lr=8e-5,
    train_num_steps=500,         # total training steps
    gradient_accumulate_every=2,    # gradient accumulation steps
    ema_decay=0.995,                # exponential moving average decay
)

trainer.train()

leiqianstat avatar Oct 20 '23 09:10 leiqianstat

Hi, @lucidrains . Could you help me see what the problem is?

leiqianstat avatar Oct 29 '23 09:10 leiqianstat

I have similar question: I tried with a 2080ti 12gb but it went OOM immediately, when I reduced to 10 images it started to train at least, but very slow and did not use much cpu or gpu at all. Do we know what hardware, image numbers and batch size is needed to utilize the hardware properly?

reinterpret-cast avatar Apr 03 '24 07:04 reinterpret-cast

I can't seem to run Unet1D either on local GPU. Unet2D seems to pick up GPU properly with "Accelerate". Even though the device is set to "cuda:0" it only uses CPU after a few seconds of GPU usage.

kidintwo3 avatar Apr 24 '24 12:04 kidintwo3

I found out one reason for slowness/idling: Using windows does not work if the DataLoader is configured with parallelism. It will fork new processes all the time which live only for short period. Windows seems to be a killer. It would be nice if a warning was printed to windows users.

reinterpret-cast avatar Apr 24 '24 17:04 reinterpret-cast

I found out one reason for slowness/idling: Using windows does not work if the DataLoader is configured with parallelism. It will fork new processes all the time which live only for short period. Windows seems to be a killer. It would be nice if a warning was printed to windows users.

Removing num_workers seems to fix it:

https://github.com/lucidrains/denoising-diffusion-pytorch/blob/9c9e403969433b2dc477cb8005d3b9f3b4117487/denoising_diffusion_pytorch/denoising_diffusion_pytorch_1d.py#L767

kidintwo3 avatar Apr 24 '24 17:04 kidintwo3