head-segmentation icon indicating copy to clipboard operation
head-segmentation copied to clipboard

Some Questions regarding inference time and current setup

Open karaposu opened this issue 1 year ago • 8 comments

Hello @wiktorlazarski ,

A couple of days ago I finished the installation and run the repo on linux VM with GPU support. I am inspecting the code for a while and I want to say I am learning a lot just by reading the code. It is so good that I want my personal project to be have very similar clean and configurable structure. Thanks again for creating this work.

Having said that, I do have some questions, and your insights would be highly appreciated.

Before I delve into them, let me give you a brief overview of my understanding of how the head-segmentation repo operates, and kindly rectify any inaccuracies.

For the model acrhitectures, this repo is dependent on the segmentation_models repo (https://github.com/qubvel/segmentation_models.pytorch). It sources pretrained encoder weights, specifically resnet34 or mobilenet_v2, from the segmentation_library. Subsequently, these encoder components are integrated with a standard UNet, transforming it into a segmentation model.

The current model uses finetuned resnet34 model. And mobilenet_v2 model weights are lost.

My main focus is on optimizing inference time. To break it down, inference time comprises:

  • Preprocessing duration
  • Transfer time of the image to the GPU
  • Time taken for the model to process the image
  • Time to transfer results back to the CPU
  • Postprocessing duration

My primary interest lies in the third point, although I've also looked into the others for a comprehensive understanding.

------Lets start with current available pipeline-------

Here is my code to check inference time :

from time import time
import cv2
import torch
import head_segmentation.segmentation_pipeline as seg_pipeline
from prettytable import PrettyTable


print("----Loading Test images----")
#img path for one of orignal celebA images (1024x1024)
image_path= "/home/enes/lab/head-segmentation/processed_dataset/test/images/1000.jpg"

image = cv2.imread(str(image_path), cv2.IMREAD_COLOR)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
print("test_img shape", image.shape)

image_512 = cv2.resize(image, (512, 512), interpolation=cv2.INTER_AREA)
image_256 = cv2.resize(image, (256, 256), interpolation=cv2.INTER_AREA)
print("resized_test_img (512,512) shape", image_512.shape)
print("resized_test_img (256,256) shape", image_256.shape)

print("----    ----")
print("  ")

print("----Check if GPU is available----")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("device:",device)
print("----    ----")
print("  ")


segmentation_pipeline = seg_pipeline.HumanHeadSegmentationPipeline()
segmentation_pipeline_GPU = seg_pipeline.HumanHeadSegmentationPipeline(device=device)

t0=time()
predicted_segmap = segmentation_pipeline.predict(image)
t1=time()
predicted_segmap = segmentation_pipeline.predict(image_512)
t2=time()
predicted_segmap = segmentation_pipeline.predict(image_256)
t3=time()
predicted_segmap = segmentation_pipeline_GPU.predict(image)
t4=time()
predicted_segmap = segmentation_pipeline_GPU.predict(image_512)
t5=time()
predicted_segmap = segmentation_pipeline_GPU.predict(image_256)
t6=time()



myTable = PrettyTable(["Image Size", "CPU", "GPU", ])

myTable.add_row(["1024", str(round(t1-t0,2))+" sec", str(round(t4-t3,2))+" sec"])
myTable.add_row(["512", str(round(t2-t1,2))+" sec", str(round(t5-t4,2))+" sec"])
myTable.add_row(["256", str(round(t3-t2,2))+" sec", str(round(t6-t5,2))+" sec"])

print(myTable)

And here is the outputs:

Screenshot 2023-10-23 at 13 23 27 Screenshot 2023-10-23 at 13 23 58

Question1: Why do you think the GPU performance drops significantly for images with a resolution of 1024x1024? Could it be due to the fact that the model was originally trained on 512x512 images, making it inefficient for the GPU to optimize larger images?

Question2: Another intriguing observation is the near-stagnant inference time on the CPU, regardless of the considerable reduction in image size. Transitioning from a 1024-sized image to a 256-sized one represents an 8-fold decrease in the input data volume. Yet, the inference time improvement is a mere 0.03 seconds.

One of my objectives is to develop a swift CPU-only version for head segmentation. Hence, these results took me by surprise.

As an initial step, I aimed to replicate the aforementioned inference times to ascertain I'm not overlooking any crucial aspects. For this, I trained the network employing the resnet34 architecture, limiting it to just 3 epochs. The image size, as specified in the config yaml file, remained unchanged at 512x512. Post-training, I loaded the latest checkpoint and retried the experiment described earlier. Below is the relevant code:


from time import time
import cv2
import torch
import head_segmentation.segmentation_pipeline as seg_pipeline
from prettytable import PrettyTable
import numpy as np

class CustomHeadSegmentationPipeline(seg_pipeline.HumanHeadSegmentationPipeline):
    def predict(self, image: np.ndarray, name) -> np.ndarray:
        t0=time()
        preprocessed_image = self._preprocess_image(image)
        t1 = time()
        preprocessed_image = preprocessed_image.to(self.device)
        t2 = time()
        mdl_out = self._model(preprocessed_image)
        t3 = time()
        mdl_out = mdl_out.cpu()
        t4 = time()
        pred_segmap = self._postprocess_model_output(mdl_out, original_image=image)
        t5= time()

        print(" ")
        print("Test details for :", name)
        print(" ")

        print("preprocessing",round(t1-t0,3))
        print("to cpu/gpu",round(t2-t1,3))
        print("model output",round(t3-t2,3))
        print("to cpu",round(t4-t3,3))
        print("postprocess",round(t5-t4,3))
        print("total",round(t5-t0,3))
        print("-------------")

        return pred_segmap


print("----Loading Test images----")
#img path for one of orignal celebA images (1024x1024)
image_path= "/home/enes/lab/head-segmentation/processed_dataset/test/images/1000.jpg"

image = cv2.imread(str(image_path), cv2.IMREAD_COLOR)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
print("test_img shape", image.shape)

image_512 = cv2.resize(image, (512, 512), interpolation=cv2.INTER_AREA)
image_256 = cv2.resize(image, (256, 256), interpolation=cv2.INTER_AREA)
print("resized_test_img (512,512) shape", image_512.shape)
print("resized_test_img (256,256) shape", image_256.shape)

print("----    ----")
print("  ")

print("----Check if GPU is available----")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("device:",device)
print("----    ----")
print("  ")


model_path_mobilenet_v2= "/home/enes/lab/head-segmentation/training_runs/2023-10-22/00-16/models/last.ckpt"
model_path_resnet34= "/home/enes/lab/head-segmentation/training_runs/2023-10-22/21-22/models/last.ckpt"

model_path=model_path_resnet34



segmentation_pipeline = CustomHeadSegmentationPipeline(model_path=model_path)
segmentation_pipeline_GPU = CustomHeadSegmentationPipeline(device=device, model_path=model_path)

t0=time()
name="1024 + CPU"
predicted_segmap = segmentation_pipeline.predict(image, name)
t1=time()
name="512 + CPU"
predicted_segmap = segmentation_pipeline.predict(image_512, name)
t2=time()
name="216 + CPU"
predicted_segmap = segmentation_pipeline.predict(image_256,name)
t3=time()
name="1024 + GPU"
predicted_segmap = segmentation_pipeline_GPU.predict(image,name)
t4=time()
name="512 + GPU"
predicted_segmap = segmentation_pipeline_GPU.predict(image_512, name)
t5=time()
name="256 + GPU"
predicted_segmap = segmentation_pipeline_GPU.predict(image_256, name)
t6=time()


print("Inference times for resnet34 --pretrained --depth=3 : ")
myTable = PrettyTable(["Image Size", "CPU", "GPU", ])

myTable.add_row(["1024", str(round(t1-t0,2))+" sec", str(round(t4-t3,2))+" sec"])
myTable.add_row(["512", str(round(t2-t1,2))+" sec", str(round(t5-t4,2))+" sec"])
myTable.add_row(["256", str(round(t3-t2,2))+" sec", str(round(t6-t5,2))+" sec"])

print(myTable)

And here you see the results for n1-standart CPU + NVIDIA T4 VM:

Screenshot 2023-10-23 at 13 28 00 Screenshot 2023-10-23 at 14 08 05

(this image shows time bottle neck indeed is model output part of the prosess. Total times are slightly different because i optained this detailed results while i run the test with a better cpu to check if there will be a big difference. )

Question3: So, in terms of CPU based inference time , although i am using the same machine, there are 4 times difference between training my model and doing a pip install to current repo. Can you point out what might be the difference with current pipeline model and what I trained?

karaposu avatar Oct 23 '23 10:10 karaposu