head-segmentation Some Questions regarding inference time and current setup

Some Questions regarding inference time and current setup

Open karaposu opened this issue 1 year ago • 8 comments

Hello @wiktorlazarski ,

A couple of days ago I finished the installation and run the repo on linux VM with GPU support. I am inspecting the code for a while and I want to say I am learning a lot just by reading the code. It is so good that I want my personal project to be have very similar clean and configurable structure. Thanks again for creating this work.

Having said that, I do have some questions, and your insights would be highly appreciated.

Before I delve into them, let me give you a brief overview of my understanding of how the head-segmentation repo operates, and kindly rectify any inaccuracies.

For the model acrhitectures, this repo is dependent on the segmentation_models repo (https://github.com/qubvel/segmentation_models.pytorch). It sources pretrained encoder weights, specifically resnet34 or mobilenet_v2, from the segmentation_library. Subsequently, these encoder components are integrated with a standard UNet, transforming it into a segmentation model.

The current model uses finetuned resnet34 model. And mobilenet_v2 model weights are lost.

My main focus is on optimizing inference time. To break it down, inference time comprises:

Preprocessing duration
Transfer time of the image to the GPU
Time taken for the model to process the image
Time to transfer results back to the CPU
Postprocessing duration

My primary interest lies in the third point, although I've also looked into the others for a comprehensive understanding.

------Lets start with current available pipeline-------

Here is my code to check inference time :

from time import time
import cv2
import torch
import head_segmentation.segmentation_pipeline as seg_pipeline
from prettytable import PrettyTable


print("----Loading Test images----")
#img path for one of orignal celebA images (1024x1024)
image_path= "/home/enes/lab/head-segmentation/processed_dataset/test/images/1000.jpg"

image = cv2.imread(str(image_path), cv2.IMREAD_COLOR)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
print("test_img shape", image.shape)

image_512 = cv2.resize(image, (512, 512), interpolation=cv2.INTER_AREA)
image_256 = cv2.resize(image, (256, 256), interpolation=cv2.INTER_AREA)
print("resized_test_img (512,512) shape", image_512.shape)
print("resized_test_img (256,256) shape", image_256.shape)

print("----    ----")
print("  ")

print("----Check if GPU is available----")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("device:",device)
print("----    ----")
print("  ")


segmentation_pipeline = seg_pipeline.HumanHeadSegmentationPipeline()
segmentation_pipeline_GPU = seg_pipeline.HumanHeadSegmentationPipeline(device=device)

t0=time()
predicted_segmap = segmentation_pipeline.predict(image)
t1=time()
predicted_segmap = segmentation_pipeline.predict(image_512)
t2=time()
predicted_segmap = segmentation_pipeline.predict(image_256)
t3=time()
predicted_segmap = segmentation_pipeline_GPU.predict(image)
t4=time()
predicted_segmap = segmentation_pipeline_GPU.predict(image_512)
t5=time()
predicted_segmap = segmentation_pipeline_GPU.predict(image_256)
t6=time()



myTable = PrettyTable(["Image Size", "CPU", "GPU", ])

myTable.add_row(["1024", str(round(t1-t0,2))+" sec", str(round(t4-t3,2))+" sec"])
myTable.add_row(["512", str(round(t2-t1,2))+" sec", str(round(t5-t4,2))+" sec"])
myTable.add_row(["256", str(round(t3-t2,2))+" sec", str(round(t6-t5,2))+" sec"])

print(myTable)

And here is the outputs:

Question1: Why do you think the GPU performance drops significantly for images with a resolution of 1024x1024? Could it be due to the fact that the model was originally trained on 512x512 images, making it inefficient for the GPU to optimize larger images?

Question2: Another intriguing observation is the near-stagnant inference time on the CPU, regardless of the considerable reduction in image size. Transitioning from a 1024-sized image to a 256-sized one represents an 8-fold decrease in the input data volume. Yet, the inference time improvement is a mere 0.03 seconds.

One of my objectives is to develop a swift CPU-only version for head segmentation. Hence, these results took me by surprise.

As an initial step, I aimed to replicate the aforementioned inference times to ascertain I'm not overlooking any crucial aspects. For this, I trained the network employing the resnet34 architecture, limiting it to just 3 epochs. The image size, as specified in the config yaml file, remained unchanged at 512x512. Post-training, I loaded the latest checkpoint and retried the experiment described earlier. Below is the relevant code:


from time import time
import cv2
import torch
import head_segmentation.segmentation_pipeline as seg_pipeline
from prettytable import PrettyTable
import numpy as np

class CustomHeadSegmentationPipeline(seg_pipeline.HumanHeadSegmentationPipeline):
    def predict(self, image: np.ndarray, name) -> np.ndarray:
        t0=time()
        preprocessed_image = self._preprocess_image(image)
        t1 = time()
        preprocessed_image = preprocessed_image.to(self.device)
        t2 = time()
        mdl_out = self._model(preprocessed_image)
        t3 = time()
        mdl_out = mdl_out.cpu()
        t4 = time()
        pred_segmap = self._postprocess_model_output(mdl_out, original_image=image)
        t5= time()

        print(" ")
        print("Test details for :", name)
        print(" ")

        print("preprocessing",round(t1-t0,3))
        print("to cpu/gpu",round(t2-t1,3))
        print("model output",round(t3-t2,3))
        print("to cpu",round(t4-t3,3))
        print("postprocess",round(t5-t4,3))
        print("total",round(t5-t0,3))
        print("-------------")

        return pred_segmap


print("----Loading Test images----")
#img path for one of orignal celebA images (1024x1024)
image_path= "/home/enes/lab/head-segmentation/processed_dataset/test/images/1000.jpg"

image = cv2.imread(str(image_path), cv2.IMREAD_COLOR)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
print("test_img shape", image.shape)

image_512 = cv2.resize(image, (512, 512), interpolation=cv2.INTER_AREA)
image_256 = cv2.resize(image, (256, 256), interpolation=cv2.INTER_AREA)
print("resized_test_img (512,512) shape", image_512.shape)
print("resized_test_img (256,256) shape", image_256.shape)

print("----    ----")
print("  ")

print("----Check if GPU is available----")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("device:",device)
print("----    ----")
print("  ")


model_path_mobilenet_v2= "/home/enes/lab/head-segmentation/training_runs/2023-10-22/00-16/models/last.ckpt"
model_path_resnet34= "/home/enes/lab/head-segmentation/training_runs/2023-10-22/21-22/models/last.ckpt"

model_path=model_path_resnet34



segmentation_pipeline = CustomHeadSegmentationPipeline(model_path=model_path)
segmentation_pipeline_GPU = CustomHeadSegmentationPipeline(device=device, model_path=model_path)

t0=time()
name="1024 + CPU"
predicted_segmap = segmentation_pipeline.predict(image, name)
t1=time()
name="512 + CPU"
predicted_segmap = segmentation_pipeline.predict(image_512, name)
t2=time()
name="216 + CPU"
predicted_segmap = segmentation_pipeline.predict(image_256,name)
t3=time()
name="1024 + GPU"
predicted_segmap = segmentation_pipeline_GPU.predict(image,name)
t4=time()
name="512 + GPU"
predicted_segmap = segmentation_pipeline_GPU.predict(image_512, name)
t5=time()
name="256 + GPU"
predicted_segmap = segmentation_pipeline_GPU.predict(image_256, name)
t6=time()


print("Inference times for resnet34 --pretrained --depth=3 : ")
myTable = PrettyTable(["Image Size", "CPU", "GPU", ])

myTable.add_row(["1024", str(round(t1-t0,2))+" sec", str(round(t4-t3,2))+" sec"])
myTable.add_row(["512", str(round(t2-t1,2))+" sec", str(round(t5-t4,2))+" sec"])
myTable.add_row(["256", str(round(t3-t2,2))+" sec", str(round(t6-t5,2))+" sec"])

print(myTable)

And here you see the results for n1-standart CPU + NVIDIA T4 VM:

(this image shows time bottle neck indeed is model output part of the prosess. Total times are slightly different because i optained this detailed results while i run the test with a better cpu to check if there will be a big difference. )

Question3: So, in terms of CPU based inference time , although i am using the same machine, there are 4 times difference between training my model and doing a pip install to current repo. Can you point out what might be the difference with current pipeline model and what I trained?

Oct 23 '23 10:10 karaposu

head-segmentation head-segmentation copied to clipboard

Some Questions regarding inference time and current setup

head-segmentation
head-segmentation copied to clipboard