icevision
icevision copied to clipboard
Memory leak in instance segmentation validation part
🐛 Bug
Describe the bug A clear and concise description of what the bug is.
After this PR https://github.com/airctic/icevision/pull/1095 the validation part finally works. But after servals epochs the validation part will run out of memory.
To Reproduce Steps to reproduce the behavior:
Use the code below:
from icevision.all import *
selection = 0
if selection == 0:
model_type = models.mmdet.mask_rcnn
backbone = model_type.backbones.resnet50_fpn_1x
if selection == 1:
model_type = models.mmdet.mask_rcnn
backbone = model_type.backbones.mask_rcnn_swin_t_p4_w7_fpn_1x_coco
if selection == 2:
model_type = models.mmdet.yolact
backbone = model_type.backbones.r101_1x8_coco
# Loading Data
# Create the parser
# Change this to any datasets which have the COCO dataformat
parser = parsers.COCOMaskParser(
annotations_filepath="/input0/annotation/train.json",
img_dir="/input0/train2017",
)
train_rs, valid_rs = parser.parse(RandomSplitter([0.2, 0.8], seed=42))
image_size = 640
train_tfms = tfms.A.Adapter([*tfms.A.resize_and_pad(image_size), tfms.A.Normalize()])
valid_tfms = tfms.A.Adapter([*tfms.A.resize_and_pad(image_size), tfms.A.Normalize()])
train_ds = Dataset(train_rs, train_tfms)
valid_ds = Dataset(valid_rs, valid_tfms)
train_dl = model_type.train_dl(train_ds, batch_size=4, num_workers=6, shuffle=True)
valid_dl = model_type.valid_dl(valid_ds, batch_size=4, num_workers=6, shuffle=False)
model = model_type.model(backbone=backbone(pretrained=True), num_classes=len(parser.class_map))
metrics = [COCOMetric(metric_type=COCOMetricType.mask, print_summary=True)]
class LightModel(model_type.lightning.ModelAdapter):
def configure_optimizers(self):
return Adam(self.parameters(), lr=5e-4)
light_model = LightModel(model, metrics=metrics)
trainer = pl.Trainer(max_epochs=5, gpus=1)
# If train without the valid_dl part, there is no memory leak any more.
trainer.fit(light_model, train_dl, valid_dl)
The memory usage will be quite high after several epochs.
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
- OS: [e.g. ubuntu 18.04]
Additional context Add any other context about the problem here.
I think this part of code have the memory leak:
https://github.com/airctic/icevision/blob/743cb7df0dae7eb1331fc2bb66fc9ca09db496cd/icevision/models/mmdet/common/mask/prediction.py#L105-L152
I use some simple debug method:
# When no prediction was made for a class the mask will be empty which creates problems when using np.vstack. So we fill empty predictions with empty masks
- empty_mask = np.full((0, sample["img"].shape[-2], sample["img"].shape[-1]), False)
- filled_raw_masks = [mask if mask != [] else empty_mask for mask in raw_masks]
+ # empty_mask = np.full((0, sample["img"].shape[-2], sample["img"].shape[-1]), False)
+ filled_raw_masks = [mask for mask in raw_masks if mask != []]
+ usage = Process().memory_info().rss / 1024 / 1024
+ logger.info("memory usage before:\t{}", usage)
keep_mask = scores > detection_threshold
keep_scores = scores[keep_mask]
keep_labels = labels[keep_mask]
keep_bboxes = [BBox.from_xyxy(*o) for o in bboxes[keep_mask]]
- keep_masks = MaskArray(np.vstack(filled_raw_masks)[keep_mask])
+ logger.info("shape of keep_mask: {}", keep_mask.shape)
+ result = np.vstack(filled_raw_masks)[keep_mask]
+ logger.info("shape of keep_mask result: {}", result.shape)
+ keep_masks = MaskArray(result)
keep_labels = convert_background_from_last_to_zero(
label_ids=keep_labels, class_map=record.detection.class_map
@@ -142,6 +149,9 @@ def convert_raw_prediction(
pred.detection.set_bboxes(keep_bboxes)
pred.detection.set_mask_array(keep_masks)
pred.above_threshold = keep_mask
+
+ usage = Process().memory_info().rss / 1024 / 1024
+ logger.info("memory usage after:\t{}", usage)
if keep_image:
image = mmdet_tensor_to_image(sample["img"])
And I found the memory usage increase quickly. This is ok if the memory is released after every validation part. But after
https://github.com/airctic/icevision/blob/743cb7df0dae7eb1331fc2bb66fc9ca09db496cd/icevision/metrics/coco_metric/coco_metric.py#L41-L43
The memory report in python is released. However the actually memory usage is still very high. After several epochs the training will run out of memory.
Is any one can give some help for this? I think it is not a big problem. The key part is in the metrics
. It seems that after _reset
the memory is not well released.
Hey, I don't have much time at the moment, but I will have a look at it when I have time again.
Hey, I don't have much time at the moment, but I will have a look at it when I have time again.
Ok, thanks and I will give some comments if I have more progress.
a fix would also be very helpful for my team. thanks for your work.