denoising-diffusion-pytorch
denoising-diffusion-pytorch copied to clipboard
GPU utilization
With the unoptimized implementation on github.com/openai/guided-diffusion, a GPU utilization of about 20 to 30% can be archived during training. What percentatge can be archived with this implementation?
Based on my experiments, the utilization is very low. Most of the time, the GPU is not computing anything but waiting. for input of batch size 6, img size 1281283, one forward training takes 18s on V100. I thought this is because of the low-efficiency implementation. Maybe this is a common pitfall of the diffusion model. Can someone explain the phenomenon?
Based on my experiments, the utilization is very low. Most of the time, the GPU is not computing anything but waiting. for input of batch size 6, img size 128_128_3, one forward training takes 18s on V100. I thought this is because of the low-efficiency implementation. Maybe this is a common pitfall of the diffusion model. Can someone explain the phenomenon?
I am seeing the same observation, GPU utilization is 0% most of the time but periodically spikes up when it is actually utilized. I guess a lot of the run time is spent on retrieving images from disk, or calling the various functions, rather than on the forward pass through the UNet that actually uses the GPU.
I observed this both for single and multi GPU training, so the overhead of synchronizing the data batch over multiple GPUs when using Accelerate is not the cause (I also switched to DataParallel and observed the same thing). Adjusting EMA update frequency also doesn't make a difference.
I think the reason is that the data loading is slow. Because many image preprocessing operations are applied when load an img from dataset.
When I use those same transforms on other datasets like LSUN/CelebA in my previous works I never found such an issue. Perhaps the use of Image.open here requires you to read image from disk directly and that is more inefficient. Other datasets like Celeb/CIFAR seems to use a different PyTorch-specific representations which could be more efficient?
I got about 20% speedup when caching the Image.open but my pc is still mostly bored
self.imgs = {}
for path in self.paths:
img = Image.open(path)
img.load()
self.imgs[path] = img
def __len__(self):
return len(self.paths)
def __getitem__(self, index):
path = self.paths[index]
img = self.imgs[path]
return self.transform(img)