denoising-diffusion-pytorch GPU utilization

With the unoptimized implementation on github.com/openai/guided-diffusion, a GPU utilization of about 20 to 30% can be archived during training. What percentatge can be archived with this implementation?

Jul 11 '22 15:07 AndreasBergmeister

Based on my experiments, the utilization is very low. Most of the time, the GPU is not computing anything but waiting. for input of batch size 6, img size 1281283, one forward training takes 18s on V100. I thought this is because of the low-efficiency implementation. Maybe this is a common pitfall of the diffusion model. Can someone explain the phenomenon?

Oct 18 '22 03:10 pengzhangzhi

Based on my experiments, the utilization is very low. Most of the time, the GPU is not computing anything but waiting. for input of batch size 6, img size 128_128_3, one forward training takes 18s on V100. I thought this is because of the low-efficiency implementation. Maybe this is a common pitfall of the diffusion model. Can someone explain the phenomenon?

I am seeing the same observation, GPU utilization is 0% most of the time but periodically spikes up when it is actually utilized. I guess a lot of the run time is spent on retrieving images from disk, or calling the various functions, rather than on the forward pass through the UNet that actually uses the GPU.

I observed this both for single and multi GPU training, so the overhead of synchronizing the data batch over multiple GPUs when using Accelerate is not the cause (I also switched to DataParallel and observed the same thing). Adjusting EMA update frequency also doesn't make a difference.

Oct 28 '22 09:10 ajrheng

I think the reason is that the data loading is slow. Because many image preprocessing operations are applied when load an img from dataset.

Oct 28 '22 10:10 pengzhangzhi

When I use those same transforms on other datasets like LSUN/CelebA in my previous works I never found such an issue. Perhaps the use of Image.open here requires you to read image from disk directly and that is more inefficient. Other datasets like Celeb/CIFAR seems to use a different PyTorch-specific representations which could be more efficient?

Nov 01 '22 07:11 ajrheng

I got about 20% speedup when caching the Image.open but my pc is still mostly bored

        self.imgs = {}
        for path in self.paths:
            img = Image.open(path)
            img.load()
            self.imgs[path] = img

    def __len__(self):
        return len(self.paths)

    def __getitem__(self, index):
        path = self.paths[index]
        img = self.imgs[path]
        return self.transform(img)

Mar 29 '23 16:03 pgancarski

denoising-diffusion-pytorch denoising-diffusion-pytorch copied to clipboard

GPU utilization

denoising-diffusion-pytorch
denoising-diffusion-pytorch copied to clipboard