DALI
DALI copied to clipboard
pipe rebuild has a memory leak?
If the wrong picture is encountered, the dali pipline will be unavailable, and the pipline needs to be rebuilt. At this time, the memory will continue to grow.
Is this a dali‘s’ bug?
Hi @ronzhou18. To better answer this question, I would need a minimal reproduction example. If you could attach a small code snippet and perhaps an "image" file that would cause an error we could reproduce on our end and analyze what is happening with the memory. Thank you
ok,thank you for your reply, I am preparing a minimal reproduction example. In addition, I just use dali to do picture mixing decoding. Dali pipe will become invalid when DALI decoding one damaged image among many images. And then I rebuild this pipe. After many repetitions, the system memory becomes larger and larger.
I can imagine that the pipeline is not properly deleted for some reason. When you share the code we'll be able to check it and give a solution. Thanks
import os
import gc
from tqdm import trange
import numpy as np
import nvidia.dali.fn as fn
import nvidia.dali.types as types
from nvidia.dali import pipeline_def
class GetIterTensor:
def __init__(self, batch_size: int, tensor_list: list):
self.batch_size = batch_size
self.tensor_list = tensor_list
def __iter__(self):
self.len = len(self.tensor_list)
self.index = 0
return self
def __next__(self):
if self.index == 0:
self.len = len(self.tensor_list)
batch_size = self.len if self.len < self.batch_size else self.batch_size
batch = self.tensor_list[self.index:batch_size + self.index]
print(len(batch))
self.index += batch_size
self.len -= batch_size
if self.len <= 0:
self.index = 0
return batch
@pipeline_def(batch_size=16,num_threads=4,device_id=0,prefetch_queue_depth=1)
def my_pipeline(source):
jpegs = fn.external_source(source, device="cpu", name="DALI_INPUT_0", dtype=types.UINT8)
images_ = fn.decoders.image(jpegs, output_type=types.RGB, device="mixed")
raw_size = 640
images = fn.resize(images_, resize_longer=raw_size)
crop_image = fn.crop_mirror_normalize(images,
crop_w=raw_size,
crop_h=raw_size,
crop_pos_x=0.5,
crop_pos_y=0.5,
scale=1 / 255.0,
dtype=types.FLOAT,
fill_values=0,
out_of_bounds_policy="pad",
output_layout="CHW")
return crop_image
def build(iter_func):
my_pipe = my_pipeline(source=iter_func)
my_pipe.build()
return my_pipe
if __name__ == "__main__":
path = "test_img_error"
images = [os.path.join(path, image) for image in os.listdir(path)]
batch_size = 16
tensor_list = []
get_data = GetIterTensor(batch_size=16, tensor_list=tensor_list)
pipe = build(get_data)
while True:
for batch_id in trange(int(len(images)/batch_size)):
batch = images[batch_id*batch_size:(batch_id+1)*batch_size]
for img in batch:
tensor_list.append(np.fromfile(img, dtype=np.uint8))
try:
pipe.run()
tensor_list.clear()
except Exception as e:
tensor_list.clear()
del pipe
gc.collect()
pipe = build(get_data)
image set:https://pan.baidu.com/s/1j8Jcbo6Gl3K-aRXc29foaw password:lm5p
Thanks
I have tried the code you shared and I found that the host memory and the GPU memory remain stable after restarting the pipeline many times. Looking at your snippet, it is properly deleting the pipeline on failure, so there should not be any leaks.
I am having problems accessing the data, since it requires creating an account and it doesn't allow me to use a phone number with country code. What I did is I mixed some valid JPEG files with some GIF files (not supported), to recreate the issue. I saw that you have some GIFs in the directory, so I am guessing that this is the source of the exception.
If you can, please upload the images somewhere else that I can access more easily and I will try with your images. If you can, please also print the exception message and share here, just in case you are hitting a different error.
Can you also give me more details on the memory that keeps growing? Is it the GPU memory or the host memory? How quickly does it grow?
Thank you for your reply.
I tried another way to reproduce this problem by creating image data. You can try it. And if you increase "times" in the code, the host memory will become larger.
import os
import gc
import cv2
import random
from tqdm import trange
import numpy as np
import nvidia.dali.fn as fn
import nvidia.dali.types as types
from nvidia.dali import pipeline_def
class GetIterTensor:
def __init__(self, batch_size: int, tensor_list: list):
self.batch_size = batch_size
self.tensor_list = tensor_list
def __iter__(self):
self.len = len(self.tensor_list)
self.index = 0
return self
def __next__(self):
if self.index == 0:
self.len = len(self.tensor_list)
batch_size = self.len if self.len < self.batch_size else self.batch_size
batch = self.tensor_list[self.index:batch_size + self.index]
self.index += batch_size
self.len -= batch_size
if self.len <= 0:
self.index = 0
return batch
@pipeline_def(batch_size=16,num_threads=4,device_id=0,prefetch_queue_depth=1)
def my_pipeline(source):
jpegs = fn.external_source(source, device="cpu", name="DALI_INPUT_0", dtype=types.UINT8)
images_ = fn.decoders.image(jpegs, output_type=types.RGB, device="mixed")
raw_size = 640
images = fn.resize(images_, resize_longer=raw_size)
crop_image = fn.crop_mirror_normalize(images,
crop_w=raw_size,
crop_h=raw_size,
crop_pos_x=0.5,
crop_pos_y=0.5,
scale=1 / 255.0,
dtype=types.FLOAT,
fill_values=0,
out_of_bounds_policy="pad",
output_layout="CHW")
return crop_image
def build(iter_func):
my_pipe = my_pipeline(source=iter_func)
my_pipe.build()
return my_pipe
def create_fake_img():
img = np.random.randn(random.randint(1080,2500), random.randint(1920,2500), 3)
img = np.array(img, dtype=np.uint8)
encoded = cv2.imencode(".jpg",img)[1]
return encoded
if __name__ == "__main__":
times = 10
encodeds = []
for time in range(times):
encoded = [create_fake_img() for i in range(8)]
encodeds.append(encoded)
# images = [os.path.join(path, image) for image in os.listdir(path)]
# batch_size = 1
tensor_list = []
get_data = GetIterTensor(batch_size=16, tensor_list=tensor_list)
pipe = build(get_data)
index = 0
while True:
corrupted = [np.delete(encodeds[index][i],
np.random.randint(i, len(encodeds[index][i]) - 1, size=random.randint(1,1080)).tolist(), 0) for i in range(8)]
images = encodeds[index] + corrupted
for img in images:
tensor_list.append(img)
try:
pipe.run()
tensor_list.clear()
print("succ")
except Exception as e:
print(e)
tensor_list.clear()
del pipe
gc.collect()
pipe = build(get_data)
index += 1
if index == times:
index = 0
I've tried your snippet, and I can see the memory utilization stabilizing. I measured with top
and with
import psutil
print('RAM memory % used:', psutil.virtual_memory()[2])
On every pipe.run()
.
ok,this reproduction method may not increase when the memory increases to a certain extent.But if I run with a large number of pictures, I will find that it has been rising and has not stabilized(Increased to 7.5GB in an hour). I don't understand why.I'll try other ways to create data to simulate it.
In addition, I want to ask why I don't add the wrong pictures, the host memory occupies 1.2GB. But like the previous code snippet, after adding the wrong picture data, due to the "build" method, the host memory occupies about 2GB to 3GB. The modified code which occupying 1.2GB of host memory is as follows:
if __name__ == "__main__":
times = 10
encodeds = []
encoded = [create_fake_img() for i in range(8)]
# images = [os.path.join(path, image) for image in os.listdir(path)]
# batch_size = 1
tensor_list = []
get_data = GetIterTensor(batch_size=16, tensor_list=tensor_list)
pipe = build(get_data)
index = 0
while True:
images = encoded*2
for img in images:
tensor_list.append(img)
try:
pipe.run()
tensor_list.clear()
print("succ")
except Exception as e:
print(e)
tensor_list.clear()
del pipe
gc.collect()
pipe = build(get_data)
index += 1
if index == times:
index = 0
Test environment: os: centos7 cuda: 11.6 dali: 1.16 gpu: 2080ti
I can confirm that the memory consumption is higher when we use the corrupted images. We will investigate why
@ronzhou18 I've run extensive tests on an example and while the working set of the process indeed grows, I couldn't observe any growth coming from nvjpeg. I'll run some more tests to include the fallback to host decoder (which will also fail) to see if there are any leaks there.
@mzient ok,thank you very much!
I wrote a C++ equivalent of the test program to exclude any leaks in Python and ran it through valgrind. It shoes exactly the same state of the heap after 10 and 100 iterations - down to a single byte. Even the python program sees very little difference in heap usage. Interestingly, the working set (or "resident" memory, in Linux terms) increases slightly over time, but it's still well below the system's RAM capacity. At present, I have no evidence to attribute this behavior to anything other than internal operation of the heap.
@ronzhou18 Did you experience any low-memory condition due to the effect, or is it just an observation of an otherwise healthy program? Also, I think the link you provided at the beginning of this issue no longer works (it's hard to tell, I don't read Chinese). I'd like to have a more realistic reproduction, because the values you reported for your dataset are much larger than anything we've observed. In order to confirm that the issue really exists, I'd need a reproduction that's closer to your original problem than the one with artificial dataset, on which we've been testing - and also, it'd have to happen regardless of memory consumption in the system.
I uploaded some pictures with problems to Google online disk. You can try them.
https://drive.google.com/drive/folders/1k2dcjB2dCnXBhpi7F4vNwo0H0gElGrpJ?usp=sharing
@mzient In both mixed mode or cpu mode, the dataset in the above link will continue to cause memory leaks when we run the program in a loop.
@ronzhou18 @KrisChou I can confirm a memory leak with the dataset provided. I will publish a fix soon.