DALI icon indicating copy to clipboard operation
DALI copied to clipboard

OOM occurs

Open zhanghang-cv opened this issue 2 years ago • 7 comments

Hello, this task is video classification training, the training set is 240,000 4-second video clips, and dali is used for video loading. The memory of the training host is 500G. During the training process, the memory usage continues to increase to 96% and then stabilizes. If the training set is boosted to 1 million, OOM occurs directly. Why does dali occupy so much memory during training and how to avoid OOM? Below is the memory usage curve during training on the 240,000 dataset.

zhanghang-cv avatar Sep 15 '22 02:09 zhanghang-cv

image

zhanghang-cv avatar Sep 15 '22 02:09 zhanghang-cv

class DALILoader(): def init(self, batch_size, file_list, sequence_length, step, stride, crop_size, device_id, mode):

    if mode == 'train':
        self.pipeline = self._create_video_reader_pipeline_train(batch_size=batch_size,
                                                                 device_id=device_id,
                                                                 num_threads=8,
                                                                 file_list=file_list,
                                                                 sequence_length=sequence_length,
                                                                 step=step,
                                                                 stride=stride,
                                                                 crop_size=crop_size)
    else:
        self.pipeline = self._create_video_reader_pipeline_infer(batch_size=batch_size,
                                                                 device_id=device_id,
                                                                 num_threads=8,
                                                                 file_list=file_list,
                                                                 sequence_length=sequence_length,
                                                                 step=step,
                                                                 stride=stride,
                                                                 crop_size=crop_size)
    self.pipeline.build()
    self.epoch_size = self.pipeline.epoch_size("Reader")
    self.dali_iterator = pytorch.DALIGenericIterator(self.pipeline,
                                                     ["data", "label"],
                                                     reader_name="Reader",
                                                     auto_reset=True,
                                                     last_batch_policy=pytorch.LastBatchPolicy.FILL,
                                                     last_batch_padded=False)

@pipeline_def
def _create_video_reader_pipeline_train(self, file_list, sequence_length, step, stride, crop_size):
    images, labels = fn.readers.video(device="gpu",
                                      file_list=file_list,
                                      sequence_length=sequence_length,
                                      step=step,
                                      stride=stride,
                                      normalized=False,
                                      random_shuffle=True,
                                      image_type=types.RGB,
                                      dtype=types.FLOAT,
                                      initial_fill=1024,
                                      pad_last_batch=True,
                                      name="Reader")

    # images = fn.resize(images, resize_x=398, resize_y=224)
    images = fn.crop(images, crop=crop_size, dtype=types.FLOAT,
                     crop_pos_x=fn.random.uniform(range=(0.1, 0.9)),
                     crop_pos_y=1)

    return images, labels

@pipeline_def
def _create_video_reader_pipeline_infer(self, file_list, sequence_length, step, stride, crop_size):
    images, labels = fn.readers.video(device="gpu",
                                      file_list=file_list,
                                      sequence_length=sequence_length,
                                      step=step,
                                      stride=stride,
                                      normalized=False,
                                      random_shuffle=True,
                                      image_type=types.RGB,
                                      dtype=types.FLOAT,
                                      initial_fill=1024,
                                      pad_last_batch=True,
                                      name="Reader")

    # images = fn.resize(images, resize_x=398, resize_y=224)
    images = fn.crop(images, crop=crop_size, dtype=types.FLOAT,
                     crop_pos_x=0.5,
                     crop_pos_y=1)

    return images, labels

def __len__(self):
    return int(self.epoch_size)

def __iter__(self):
    return self.dali_iterator.__iter__()

zhanghang-cv avatar Sep 15 '22 02:09 zhanghang-cv

Hi @zhanghang-cv,

For now, DALI creates a libaviutil context for each video in the dataset - see https://github.com/NVIDIA/DALI/issues/2220 for more details. So in your case, if you have 1 million videos it can consume 10 GB of CPU RAM. We can think about the tradeoff between recreating the context each time video is needed and keeping it for later to speed up the decoding process. The creation of context is not free and can impact the overall decoding speed, accounting for the fact that DALI composes batches from sequences or randomly picked samples from any video in the dataset. The only solution that comes to my mind is to cache only an N of contexts and free the last recently used here.

JanuszL avatar Sep 15 '22 05:09 JanuszL

Thank you for your reply. What was not stated before is that this task uses 8 pipelines for data loading, and the large memory usage is understandable. Below is an up-to-date memory usage curve (OOM). The process I understand is divided into three stages, the first stage pipeline is initialized, the second stage preloads data, and the second stage ends to get the first batch of data. The third stage starts looping to obtain batches of data for training. We observed that each time a batch of data is loaded in the third stage, the memory will slowly increase (this will cause OOM). What is the reason for this, will each batch of data be saved in memory after loading? image

zhanghang-cv avatar Sep 15 '22 07:09 zhanghang-cv

Hi @zhanghang-cv,

How exactly do you measure memory consumption? Can the increase come from the fact that OS is caching the data in RAM when accessing it from the drive?

JanuszL avatar Sep 15 '22 07:09 JanuszL

This situation is also possible. I want to make sure that after loading each batch of data, dali will save the data to memory for the next call?

zhanghang-cv avatar Sep 15 '22 07:09 zhanghang-cv

DALI doesn't use RAM to store decoded video for the GPU video decoder, so it doesn't seem to be the reason.

JanuszL avatar Sep 15 '22 07:09 JanuszL