icevision icon indicating copy to clipboard operation
icevision copied to clipboard

Memory leak in instance segmentation validation part

Open aisensiy opened this issue 2 years ago • 4 comments

🐛 Bug

Describe the bug A clear and concise description of what the bug is.

After this PR https://github.com/airctic/icevision/pull/1095 the validation part finally works. But after servals epochs the validation part will run out of memory.

To Reproduce Steps to reproduce the behavior:

Use the code below:

from icevision.all import *

selection = 0

if selection == 0:
    model_type = models.mmdet.mask_rcnn
    backbone = model_type.backbones.resnet50_fpn_1x

if selection == 1:
    model_type = models.mmdet.mask_rcnn
    backbone = model_type.backbones.mask_rcnn_swin_t_p4_w7_fpn_1x_coco

if selection == 2:
    model_type = models.mmdet.yolact
    backbone = model_type.backbones.r101_1x8_coco


# Loading Data
# Create the parser
# Change this to any datasets which have the COCO dataformat
parser = parsers.COCOMaskParser(
    annotations_filepath="/input0/annotation/train.json",
    img_dir="/input0/train2017",
)

train_rs, valid_rs = parser.parse(RandomSplitter([0.2, 0.8], seed=42))
image_size = 640
train_tfms = tfms.A.Adapter([*tfms.A.resize_and_pad(image_size), tfms.A.Normalize()])
valid_tfms = tfms.A.Adapter([*tfms.A.resize_and_pad(image_size), tfms.A.Normalize()])

train_ds = Dataset(train_rs, train_tfms)
valid_ds = Dataset(valid_rs, valid_tfms)

train_dl = model_type.train_dl(train_ds, batch_size=4, num_workers=6, shuffle=True)
valid_dl = model_type.valid_dl(valid_ds, batch_size=4, num_workers=6, shuffle=False)

model = model_type.model(backbone=backbone(pretrained=True), num_classes=len(parser.class_map)) 


metrics = [COCOMetric(metric_type=COCOMetricType.mask, print_summary=True)]

class LightModel(model_type.lightning.ModelAdapter):
    def configure_optimizers(self):
        return Adam(self.parameters(), lr=5e-4)
    
light_model = LightModel(model, metrics=metrics)
trainer = pl.Trainer(max_epochs=5, gpus=1)
# If train without the valid_dl part, there is no memory leak any more.
trainer.fit(light_model, train_dl, valid_dl) 

The memory usage will be quite high after several epochs.

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. ubuntu 18.04]

Additional context Add any other context about the problem here.

I think this part of code have the memory leak:

https://github.com/airctic/icevision/blob/743cb7df0dae7eb1331fc2bb66fc9ca09db496cd/icevision/models/mmdet/common/mask/prediction.py#L105-L152

I use some simple debug method:

     # When no prediction was made for a class the mask will be empty which creates problems when using np.vstack. So we fill empty predictions with empty masks
-    empty_mask = np.full((0, sample["img"].shape[-2], sample["img"].shape[-1]), False)
-    filled_raw_masks = [mask if mask != [] else empty_mask for mask in raw_masks]
+    # empty_mask = np.full((0, sample["img"].shape[-2], sample["img"].shape[-1]), False)
+    filled_raw_masks = [mask for mask in raw_masks if mask != []]
 
+    usage = Process().memory_info().rss / 1024 / 1024
+    logger.info("memory usage before:\t{}", usage)
     keep_mask = scores > detection_threshold
     keep_scores = scores[keep_mask]
     keep_labels = labels[keep_mask]
     keep_bboxes = [BBox.from_xyxy(*o) for o in bboxes[keep_mask]]
-    keep_masks = MaskArray(np.vstack(filled_raw_masks)[keep_mask])
+    logger.info("shape of keep_mask:        {}", keep_mask.shape)
+    result = np.vstack(filled_raw_masks)[keep_mask]
+    logger.info("shape of keep_mask result: {}", result.shape)
+    keep_masks = MaskArray(result)
 
     keep_labels = convert_background_from_last_to_zero(
         label_ids=keep_labels, class_map=record.detection.class_map
@@ -142,6 +149,9 @@ def convert_raw_prediction(
     pred.detection.set_bboxes(keep_bboxes)
     pred.detection.set_mask_array(keep_masks)
     pred.above_threshold = keep_mask
+    
+    usage = Process().memory_info().rss / 1024 / 1024
+    logger.info("memory usage after:\t{}", usage)
 
     if keep_image:
         image = mmdet_tensor_to_image(sample["img"])

And I found the memory usage increase quickly. This is ok if the memory is released after every validation part. But after

https://github.com/airctic/icevision/blob/743cb7df0dae7eb1331fc2bb66fc9ca09db496cd/icevision/metrics/coco_metric/coco_metric.py#L41-L43

The memory report in python is released. However the actually memory usage is still very high. After several epochs the training will run out of memory.

aisensiy avatar May 05 '22 03:05 aisensiy

Is any one can give some help for this? I think it is not a big problem. The key part is in the metrics. It seems that after _reset the memory is not well released.

aisensiy avatar May 10 '22 08:05 aisensiy

Hey, I don't have much time at the moment, but I will have a look at it when I have time again.

fstroth avatar May 10 '22 09:05 fstroth

Hey, I don't have much time at the moment, but I will have a look at it when I have time again.

Ok, thanks and I will give some comments if I have more progress.

aisensiy avatar May 11 '22 11:05 aisensiy

a fix would also be very helpful for my team. thanks for your work.

jlvahldiek avatar Jun 09 '22 08:06 jlvahldiek