DALI icon indicating copy to clipboard operation
DALI copied to clipboard

pipe rebuild has a memory leak?

Open ronzhou18 opened this issue 2 years ago • 16 comments

If the wrong picture is encountered, the dali pipline will be unavailable, and the pipline needs to be rebuilt. At this time, the memory will continue to grow.

Is this a dali‘s’ bug?

ronzhou18 avatar Jul 20 '22 03:07 ronzhou18

Hi @ronzhou18. To better answer this question, I would need a minimal reproduction example. If you could attach a small code snippet and perhaps an "image" file that would cause an error we could reproduce on our end and analyze what is happening with the memory. Thank you

jantonguirao avatar Jul 20 '22 07:07 jantonguirao

ok,thank you for your reply, I am preparing a minimal reproduction example. In addition, I just use dali to do picture mixing decoding. Dali pipe will become invalid when DALI decoding one damaged image among many images. And then I rebuild this pipe. After many repetitions, the system memory becomes larger and larger.

ronzhou18 avatar Jul 20 '22 08:07 ronzhou18

I can imagine that the pipeline is not properly deleted for some reason. When you share the code we'll be able to check it and give a solution. Thanks

jantonguirao avatar Jul 20 '22 09:07 jantonguirao

import os
import gc
from tqdm import trange
import numpy as np
import nvidia.dali.fn as fn
import nvidia.dali.types as types
from nvidia.dali import pipeline_def

class GetIterTensor:
    def __init__(self, batch_size: int, tensor_list: list):
        self.batch_size = batch_size
        self.tensor_list = tensor_list

    def __iter__(self):
        self.len = len(self.tensor_list)
        self.index = 0
        return self

    def __next__(self):
        if self.index == 0:
            self.len = len(self.tensor_list)

        batch_size = self.len if self.len < self.batch_size else self.batch_size
        batch = self.tensor_list[self.index:batch_size + self.index]
        print(len(batch))

        self.index += batch_size
        self.len -= batch_size
        if self.len <= 0:
            self.index = 0
        return batch

@pipeline_def(batch_size=16,num_threads=4,device_id=0,prefetch_queue_depth=1)
def my_pipeline(source):
    jpegs = fn.external_source(source, device="cpu", name="DALI_INPUT_0", dtype=types.UINT8)
    images_ = fn.decoders.image(jpegs, output_type=types.RGB, device="mixed")
    raw_size = 640
    images = fn.resize(images_, resize_longer=raw_size)
    crop_image = fn.crop_mirror_normalize(images,
                                          crop_w=raw_size,
                                          crop_h=raw_size,
                                          crop_pos_x=0.5,
                                          crop_pos_y=0.5,
                                          scale=1 / 255.0,
                                          dtype=types.FLOAT,
                                          fill_values=0,
                                          out_of_bounds_policy="pad",
                                          output_layout="CHW")
    return crop_image

def build(iter_func):
    my_pipe = my_pipeline(source=iter_func)
    my_pipe.build()
    return my_pipe

if __name__ == "__main__":
    path = "test_img_error"
    images = [os.path.join(path, image) for image in os.listdir(path)]
    batch_size = 16
    tensor_list = []
    get_data = GetIterTensor(batch_size=16, tensor_list=tensor_list)
    pipe = build(get_data)

    while True:
        for batch_id in trange(int(len(images)/batch_size)):
            batch = images[batch_id*batch_size:(batch_id+1)*batch_size]
            for img in batch:
                tensor_list.append(np.fromfile(img, dtype=np.uint8))

            try:
                pipe.run()
                tensor_list.clear()
            except Exception as e:
                tensor_list.clear()
                del pipe
                gc.collect()
                pipe = build(get_data)


image set:https://pan.baidu.com/s/1j8Jcbo6Gl3K-aRXc29foaw password:lm5p

Thanks

ronzhou18 avatar Jul 20 '22 09:07 ronzhou18

I have tried the code you shared and I found that the host memory and the GPU memory remain stable after restarting the pipeline many times. Looking at your snippet, it is properly deleting the pipeline on failure, so there should not be any leaks.

I am having problems accessing the data, since it requires creating an account and it doesn't allow me to use a phone number with country code. What I did is I mixed some valid JPEG files with some GIF files (not supported), to recreate the issue. I saw that you have some GIFs in the directory, so I am guessing that this is the source of the exception.

If you can, please upload the images somewhere else that I can access more easily and I will try with your images. If you can, please also print the exception message and share here, just in case you are hitting a different error.

Can you also give me more details on the memory that keeps growing? Is it the GPU memory or the host memory? How quickly does it grow?

jantonguirao avatar Jul 20 '22 13:07 jantonguirao

Thank you for your reply.

I tried another way to reproduce this problem by creating image data. You can try it. And if you increase "times" in the code, the host memory will become larger.

import os
import gc
import cv2
import random
from tqdm import trange
import numpy as np
import nvidia.dali.fn as fn
import nvidia.dali.types as types
from nvidia.dali import pipeline_def

class GetIterTensor:
    def __init__(self, batch_size: int, tensor_list: list):
        self.batch_size = batch_size
        self.tensor_list = tensor_list

    def __iter__(self):
        self.len = len(self.tensor_list)
        self.index = 0
        return self

    def __next__(self):
        if self.index == 0:
            self.len = len(self.tensor_list)

        batch_size = self.len if self.len < self.batch_size else self.batch_size
        batch = self.tensor_list[self.index:batch_size + self.index]

        self.index += batch_size
        self.len -= batch_size
        if self.len <= 0:
            self.index = 0
        return batch

@pipeline_def(batch_size=16,num_threads=4,device_id=0,prefetch_queue_depth=1)
def my_pipeline(source):
    jpegs = fn.external_source(source, device="cpu", name="DALI_INPUT_0", dtype=types.UINT8)
    images_ = fn.decoders.image(jpegs, output_type=types.RGB, device="mixed")
    raw_size = 640
    images = fn.resize(images_, resize_longer=raw_size)
    crop_image = fn.crop_mirror_normalize(images,
                                          crop_w=raw_size,
                                          crop_h=raw_size,
                                          crop_pos_x=0.5,
                                          crop_pos_y=0.5,
                                          scale=1 / 255.0,
                                          dtype=types.FLOAT,
                                          fill_values=0,
                                          out_of_bounds_policy="pad",
                                          output_layout="CHW")
    return crop_image

def build(iter_func):
    my_pipe = my_pipeline(source=iter_func)
    my_pipe.build()
    return my_pipe

def create_fake_img():
    img = np.random.randn(random.randint(1080,2500), random.randint(1920,2500), 3)
    img = np.array(img, dtype=np.uint8)
    encoded = cv2.imencode(".jpg",img)[1]
    return encoded
if __name__ == "__main__":

    times = 10
    encodeds = []
    for time in range(times):
        encoded = [create_fake_img() for i in range(8)]
        encodeds.append(encoded)

    # images = [os.path.join(path, image) for image in os.listdir(path)]
    # batch_size = 1
    tensor_list = []
    get_data = GetIterTensor(batch_size=16, tensor_list=tensor_list)
    pipe = build(get_data)

    index = 0
    while True:
        corrupted = [np.delete(encodeds[index][i],
                               np.random.randint(i, len(encodeds[index][i]) - 1, size=random.randint(1,1080)).tolist(), 0) for i in range(8)]


        images = encodeds[index] + corrupted
        for img in images:
            tensor_list.append(img)

        try:
            pipe.run()
            tensor_list.clear()
            print("succ")
        except Exception as e:
            print(e)
            tensor_list.clear()
            del pipe
            gc.collect()
            pipe = build(get_data)
        index += 1
        if index == times:
            index = 0

			

ronzhou18 avatar Jul 21 '22 03:07 ronzhou18

I've tried your snippet, and I can see the memory utilization stabilizing. I measured with top and with

import psutil
print('RAM memory % used:', psutil.virtual_memory()[2])

On every pipe.run().

jantonguirao avatar Jul 21 '22 11:07 jantonguirao

ok,this reproduction method may not increase when the memory increases to a certain extent.But if I run with a large number of pictures, I will find that it has been rising and has not stabilized(Increased to 7.5GB in an hour). I don't understand why.I'll try other ways to create data to simulate it.

In addition, I want to ask why I don't add the wrong pictures, the host memory occupies 1.2GB. But like the previous code snippet, after adding the wrong picture data, due to the "build" method, the host memory occupies about 2GB to 3GB. The modified code which occupying 1.2GB of host memory is as follows:

if __name__ == "__main__":

    times = 10
    encodeds = []
    encoded = [create_fake_img() for i in range(8)]
    # images = [os.path.join(path, image) for image in os.listdir(path)]
    # batch_size = 1
    tensor_list = []
    get_data = GetIterTensor(batch_size=16, tensor_list=tensor_list)
    pipe = build(get_data)

    index = 0
    while True:
        images = encoded*2
        for img in images:
            tensor_list.append(img)

        try:
            pipe.run()
            tensor_list.clear()
            print("succ")
        except Exception as e:
            print(e)
            tensor_list.clear()
            del pipe
            gc.collect()
            pipe = build(get_data)
        index += 1
        if index == times:
            index = 0

			

Test environment: os: centos7 cuda: 11.6 dali: 1.16 gpu: 2080ti

ronzhou18 avatar Jul 22 '22 09:07 ronzhou18

I can confirm that the memory consumption is higher when we use the corrupted images. We will investigate why

jantonguirao avatar Jul 22 '22 12:07 jantonguirao

@ronzhou18 I've run extensive tests on an example and while the working set of the process indeed grows, I couldn't observe any growth coming from nvjpeg. I'll run some more tests to include the fallback to host decoder (which will also fail) to see if there are any leaks there.

mzient avatar Jul 28 '22 08:07 mzient

@mzient ok,thank you very much!

ronzhou18 avatar Jul 29 '22 01:07 ronzhou18

I wrote a C++ equivalent of the test program to exclude any leaks in Python and ran it through valgrind. It shoes exactly the same state of the heap after 10 and 100 iterations - down to a single byte. Even the python program sees very little difference in heap usage. Interestingly, the working set (or "resident" memory, in Linux terms) increases slightly over time, but it's still well below the system's RAM capacity. At present, I have no evidence to attribute this behavior to anything other than internal operation of the heap.

mzient avatar Jul 29 '22 07:07 mzient

@ronzhou18 Did you experience any low-memory condition due to the effect, or is it just an observation of an otherwise healthy program? Also, I think the link you provided at the beginning of this issue no longer works (it's hard to tell, I don't read Chinese). I'd like to have a more realistic reproduction, because the values you reported for your dataset are much larger than anything we've observed. In order to confirm that the issue really exists, I'd need a reproduction that's closer to your original problem than the one with artificial dataset, on which we've been testing - and also, it'd have to happen regardless of memory consumption in the system.

mzient avatar Aug 01 '22 14:08 mzient

I uploaded some pictures with problems to Google online disk. You can try them.

https://drive.google.com/drive/folders/1k2dcjB2dCnXBhpi7F4vNwo0H0gElGrpJ?usp=sharing

ronzhou18 avatar Aug 03 '22 03:08 ronzhou18

@mzient In both mixed mode or cpu mode, the dataset in the above link will continue to cause memory leaks when we run the program in a loop.

ToTheMonn avatar Aug 03 '22 07:08 ToTheMonn

@ronzhou18 @KrisChou I can confirm a memory leak with the dataset provided. I will publish a fix soon.

jantonguirao avatar Aug 04 '22 15:08 jantonguirao

Hi @ronzhou18,

DALI v1.16.1 with the fix for the reported issue has been released.

stiepan avatar Aug 26 '22 15:08 stiepan