vision
vision copied to clipboard
Memory leak on GaussianBlur
🐛 Describe the bug
Hello. When using num_workers > 0 for dataloader and GaussianBlur BEFORE the resize function in transforms (images in dataset are of different size) a memory leak appears. The larger num_workers used, the more the leak is (I ran out of 128 GB RAM in 300 iterations with batch_size of 32 and num_workers of 16). To reproduce (you should initialize images with array of filepaths to images):
import torch
import glob
from torchvision import transforms
from PIL import Image
class FramesDataset(torch.utils.data.Dataset):
def __init__(self, images):
self.images = images
self.init_base_transform()
def __len__(self):
return len(self.images)
def init_base_transform(self):
self.tr_aug = transforms.Compose([transforms.GaussianBlur(7, (1, 5)),
transforms.Resize((256, 256), antialias=True),
transforms.ToTensor(),
transforms.Normalize([0.5]*3, [0.5]*3) ])
def __getitem__(self, idx):
img = Image.open(self.images[idx]).convert('RGB')
out = self.tr_aug(img)
return out
dataset = TestDataset(images)
dl = torch.utils.data.DataLoader(dataset, batch_size = 16, num_workers = 8, pin_memory = False)
while True:
for batch in dl:
pass
Versions
torch: 1.12.0+cu116 torchvision: 0.13.0+cu116 PIL: 9.0.0 Ubuntu: 20.04.4 LTS
cc @vfdev-5 @datumbox
@GLivshits thanks fore reporting this.
What's special about GaussianBlur
is that it doesn't handle natively PIL images and does a conversion from PIL to Tensor and back. We'll have to check if there is a leak somewhere during that conversion but this is hard and might not be an issue on our side. It would help a lot if you can help us narrow this down a bit. Can you replace the PIL read with something like:
img = torchvision.io.read_file(self.images[idx])
You won't need a ToTensor() call in your transforms. Everything else remains the same. Do you still observe a memory leak?
Replaced PIL.Image.open with torchvision.io.read_image (it outputs uint8 tensor), still leaks.
import torch
import torchvision
from torchvision import transforms
class TestDataset(torch.utils.data.Dataset):
def __init__(self, images):
self.images = images
self.init_base_transform()
def __len__(self):
return len(self.images)
def init_base_transform(self):
self.tr_aug = transforms.Compose([transforms.GaussianBlur(7, (1, 5)),
transforms.Resize((256, 256)),
transforms.Normalize([0.5]*3, [0.5]*3)])
def __getitem__(self, idx):
img = torchvision.io.read_image(images[idx]).type(torch.float32).div(255.)
out = self.tr_aug(img)
return out
dataset = TestDataset(images)
dl = torch.utils.data.DataLoader(dataset, batch_size = 16, num_workers = 8, pin_memory = False)
while True:
for batch in dl:
pass
@GLivshits I tried to reproduce the issue by measuring the memory with your script + psutil:
import os
import psutil
import torch
from torchvision import transforms
from PIL import Image
class FramesDataset(torch.utils.data.Dataset):
def __init__(self, images):
self.images = images
self.init_base_transform()
def __len__(self):
return len(self.images)
def init_base_transform(self):
self.tr_aug = transforms.Compose(
[
transforms.GaussianBlur(7, (1, 5)),
transforms.Resize((256, 256), antialias=True),
transforms.ToTensor(),
transforms.Normalize([0.5]*3, [0.5]*3)
]
)
def __getitem__(self, idx):
img = Image.open(self.images[idx]).convert('RGB')
out = self.tr_aug(img)
return out
images = ["test-image.jpg" for _ in range(1000)]
dataset = FramesDataset(images)
dl = torch.utils.data.DataLoader(dataset, batch_size=16, num_workers=8, pin_memory=False)
p = psutil.Process(os.getpid())
epoch = 0
while epoch < 100:
mem_usage = p.memory_info().rss / 1024 / 1024
print(epoch, "- mem_usage:", mem_usage)
for batch in dl:
pass
epoch += 1
I did 2 experiments:
- with
GaussianBlur
as in the code above
Output
0 - mem_usage: 204.5625
1 - mem_usage: 208.80078125
2 - mem_usage: 208.8359375
3 - mem_usage: 208.85546875
4 - mem_usage: 208.8671875
5 - mem_usage: 208.87890625
6 - mem_usage: 208.88671875
7 - mem_usage: 208.91015625
8 - mem_usage: 208.91015625
9 - mem_usage: 208.921875
10 - mem_usage: 208.921875
11 - mem_usage: 208.92578125
12 - mem_usage: 208.9296875
13 - mem_usage: 208.9375
14 - mem_usage: 208.9375
15 - mem_usage: 208.9375
16 - mem_usage: 208.94140625
17 - mem_usage: 208.9453125
18 - mem_usage: 208.9453125
19 - mem_usage: 208.94921875
20 - mem_usage: 208.94921875
21 - mem_usage: 208.95703125
22 - mem_usage: 208.9609375
23 - mem_usage: 208.9609375
24 - mem_usage: 208.96484375
25 - mem_usage: 208.96875
26 - mem_usage: 208.97265625
27 - mem_usage: 208.98828125
28 - mem_usage: 209.0
29 - mem_usage: 209.0
30 - mem_usage: 209.0
31 - mem_usage: 209.0
32 - mem_usage: 209.0
33 - mem_usage: 209.0
34 - mem_usage: 209.0
35 - mem_usage: 209.00390625
36 - mem_usage: 209.00390625
37 - mem_usage: 209.015625
38 - mem_usage: 209.015625
39 - mem_usage: 209.015625
40 - mem_usage: 209.015625
41 - mem_usage: 209.015625
42 - mem_usage: 209.01953125
43 - mem_usage: 209.01953125
44 - mem_usage: 209.0234375
45 - mem_usage: 209.0234375
46 - mem_usage: 209.03515625
47 - mem_usage: 209.03515625
48 - mem_usage: 209.03515625
49 - mem_usage: 209.03515625
50 - mem_usage: 209.03515625
51 - mem_usage: 209.03515625
52 - mem_usage: 209.03515625
53 - mem_usage: 209.03515625
54 - mem_usage: 209.03515625
55 - mem_usage: 209.0390625
56 - mem_usage: 209.0390625
57 - mem_usage: 209.04296875
58 - mem_usage: 209.04296875
59 - mem_usage: 209.04296875
60 - mem_usage: 209.04296875
61 - mem_usage: 209.04296875
62 - mem_usage: 209.04296875
63 - mem_usage: 209.04296875
64 - mem_usage: 209.04296875
65 - mem_usage: 209.04296875
66 - mem_usage: 209.04296875
67 - mem_usage: 209.04296875
68 - mem_usage: 209.04296875
69 - mem_usage: 209.04296875
70 - mem_usage: 209.04296875
71 - mem_usage: 209.04296875
72 - mem_usage: 209.04296875
73 - mem_usage: 209.046875
74 - mem_usage: 209.046875
75 - mem_usage: 209.046875
76 - mem_usage: 209.046875
77 - mem_usage: 209.046875
78 - mem_usage: 209.046875
79 - mem_usage: 209.046875
80 - mem_usage: 209.046875
81 - mem_usage: 209.046875
82 - mem_usage: 209.046875
83 - mem_usage: 209.046875
84 - mem_usage: 209.046875
85 - mem_usage: 209.046875
86 - mem_usage: 209.0546875
87 - mem_usage: 209.05859375
88 - mem_usage: 209.0625
89 - mem_usage: 209.0625
90 - mem_usage: 209.0625
91 - mem_usage: 209.0625
92 - mem_usage: 209.0625
93 - mem_usage: 209.0625
94 - mem_usage: 209.0625
95 - mem_usage: 209.0625
96 - mem_usage: 209.0625
97 - mem_usage: 209.0625
98 - mem_usage: 209.0625
99 - mem_usage: 209.0625
- without
GaussianBlur
as
self.tr_aug = transforms.Compose(
[
transforms.Resize((256, 256), antialias=True),
transforms.ToTensor(),
transforms.Normalize([0.5]*3, [0.5]*3)
]
)
Output
0 - mem_usage: 204.5703125
1 - mem_usage: 209.0546875
2 - mem_usage: 209.09765625
3 - mem_usage: 209.109375
4 - mem_usage: 209.1171875
5 - mem_usage: 209.12109375
6 - mem_usage: 209.1328125
7 - mem_usage: 209.15625
8 - mem_usage: 209.19140625
9 - mem_usage: 209.1953125
10 - mem_usage: 209.203125
11 - mem_usage: 209.203125
12 - mem_usage: 209.20703125
13 - mem_usage: 209.20703125
14 - mem_usage: 209.2109375
15 - mem_usage: 209.2109375
16 - mem_usage: 209.21484375
17 - mem_usage: 209.234375
18 - mem_usage: 209.23828125
19 - mem_usage: 209.23828125
20 - mem_usage: 209.24609375
21 - mem_usage: 209.25
22 - mem_usage: 209.25
23 - mem_usage: 209.25
24 - mem_usage: 209.25390625
25 - mem_usage: 209.26171875
26 - mem_usage: 209.26171875
27 - mem_usage: 209.265625
28 - mem_usage: 209.265625
29 - mem_usage: 209.265625
30 - mem_usage: 209.265625
31 - mem_usage: 209.265625
32 - mem_usage: 209.265625
33 - mem_usage: 209.27734375
34 - mem_usage: 209.2890625
35 - mem_usage: 209.2890625
36 - mem_usage: 209.2890625
37 - mem_usage: 209.296875
38 - mem_usage: 209.30078125
39 - mem_usage: 209.3046875
40 - mem_usage: 209.30859375
41 - mem_usage: 209.30859375
42 - mem_usage: 209.3125
43 - mem_usage: 209.3203125
44 - mem_usage: 209.328125
45 - mem_usage: 209.33203125
46 - mem_usage: 209.33203125
47 - mem_usage: 209.33203125
48 - mem_usage: 209.33203125
49 - mem_usage: 209.33203125
50 - mem_usage: 209.33203125
51 - mem_usage: 209.33203125
52 - mem_usage: 209.33203125
53 - mem_usage: 209.33203125
54 - mem_usage: 209.33203125
55 - mem_usage: 209.33203125
56 - mem_usage: 209.33203125
57 - mem_usage: 209.33203125
58 - mem_usage: 209.33203125
59 - mem_usage: 209.33203125
60 - mem_usage: 209.33203125
61 - mem_usage: 209.33203125
62 - mem_usage: 209.3359375
63 - mem_usage: 209.3359375
64 - mem_usage: 209.3359375
65 - mem_usage: 209.3359375
66 - mem_usage: 209.3359375
67 - mem_usage: 209.3359375
68 - mem_usage: 209.3359375
69 - mem_usage: 209.3359375
70 - mem_usage: 209.3359375
71 - mem_usage: 209.3359375
72 - mem_usage: 209.3359375
73 - mem_usage: 209.3359375
74 - mem_usage: 209.3359375
75 - mem_usage: 209.3359375
76 - mem_usage: 209.3359375
77 - mem_usage: 209.3359375
78 - mem_usage: 209.3359375
79 - mem_usage: 209.3359375
80 - mem_usage: 209.3359375
81 - mem_usage: 209.33984375
82 - mem_usage: 209.33984375
83 - mem_usage: 209.33984375
84 - mem_usage: 209.33984375
85 - mem_usage: 209.33984375
86 - mem_usage: 209.33984375
87 - mem_usage: 209.33984375
88 - mem_usage: 209.33984375
89 - mem_usage: 209.33984375
90 - mem_usage: 209.33984375
91 - mem_usage: 209.33984375
92 - mem_usage: 209.33984375
93 - mem_usage: 209.33984375
94 - mem_usage: 209.33984375
95 - mem_usage: 209.33984375
96 - mem_usage: 209.33984375
97 - mem_usage: 209.33984375
98 - mem_usage: 209.33984375
99 - mem_usage: 209.34375
My pytorch, vision versions: '1.13.0.dev20220704+cpu', '0.14.0a0'
I see that in both logs mem consumption is growing. Can you detail how could you identify that it is GaussianBlur causing mem leak ?
@vfdef-5 I'm just watching htop. The thing is that if your images are of the same size everything works ok (I also tried loading dataset of one image and there is no memory leak). But if there are multiple sizes of images - leak appears. It seems like there is some memory reserved for blurring operation in dataset, and if a tensor of some new size comes - leak does appear);
If a single image of different sizes is used there is still a 4 GB RAM overhead.
@vfdev-5 I've excluded every other augmentation and swapped the order of blur and resize and found matching leaky configuration. The code provided is already a localized version of a problem.