head-segmentation icon indicating copy to clipboard operation
head-segmentation copied to clipboard

Some Questions regarding inference time and current setup

Open karaposu opened this issue 2 years ago • 8 comments

Hello @wiktorlazarski ,

A couple of days ago I finished the installation and run the repo on linux VM with GPU support. I am inspecting the code for a while and I want to say I am learning a lot just by reading the code. It is so good that I want my personal project to be have very similar clean and configurable structure. Thanks again for creating this work.

Having said that, I do have some questions, and your insights would be highly appreciated.

Before I delve into them, let me give you a brief overview of my understanding of how the head-segmentation repo operates, and kindly rectify any inaccuracies.

For the model acrhitectures, this repo is dependent on the segmentation_models repo (https://github.com/qubvel/segmentation_models.pytorch). It sources pretrained encoder weights, specifically resnet34 or mobilenet_v2, from the segmentation_library. Subsequently, these encoder components are integrated with a standard UNet, transforming it into a segmentation model.

The current model uses finetuned resnet34 model. And mobilenet_v2 model weights are lost.

My main focus is on optimizing inference time. To break it down, inference time comprises:

  • Preprocessing duration
  • Transfer time of the image to the GPU
  • Time taken for the model to process the image
  • Time to transfer results back to the CPU
  • Postprocessing duration

My primary interest lies in the third point, although I've also looked into the others for a comprehensive understanding.

------Lets start with current available pipeline-------

Here is my code to check inference time :

from time import time
import cv2
import torch
import head_segmentation.segmentation_pipeline as seg_pipeline
from prettytable import PrettyTable


print("----Loading Test images----")
#img path for one of orignal celebA images (1024x1024)
image_path= "/home/enes/lab/head-segmentation/processed_dataset/test/images/1000.jpg"

image = cv2.imread(str(image_path), cv2.IMREAD_COLOR)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
print("test_img shape", image.shape)

image_512 = cv2.resize(image, (512, 512), interpolation=cv2.INTER_AREA)
image_256 = cv2.resize(image, (256, 256), interpolation=cv2.INTER_AREA)
print("resized_test_img (512,512) shape", image_512.shape)
print("resized_test_img (256,256) shape", image_256.shape)

print("----    ----")
print("  ")

print("----Check if GPU is available----")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("device:",device)
print("----    ----")
print("  ")


segmentation_pipeline = seg_pipeline.HumanHeadSegmentationPipeline()
segmentation_pipeline_GPU = seg_pipeline.HumanHeadSegmentationPipeline(device=device)

t0=time()
predicted_segmap = segmentation_pipeline.predict(image)
t1=time()
predicted_segmap = segmentation_pipeline.predict(image_512)
t2=time()
predicted_segmap = segmentation_pipeline.predict(image_256)
t3=time()
predicted_segmap = segmentation_pipeline_GPU.predict(image)
t4=time()
predicted_segmap = segmentation_pipeline_GPU.predict(image_512)
t5=time()
predicted_segmap = segmentation_pipeline_GPU.predict(image_256)
t6=time()



myTable = PrettyTable(["Image Size", "CPU", "GPU", ])

myTable.add_row(["1024", str(round(t1-t0,2))+" sec", str(round(t4-t3,2))+" sec"])
myTable.add_row(["512", str(round(t2-t1,2))+" sec", str(round(t5-t4,2))+" sec"])
myTable.add_row(["256", str(round(t3-t2,2))+" sec", str(round(t6-t5,2))+" sec"])

print(myTable)

And here is the outputs:

Screenshot 2023-10-23 at 13 23 27 Screenshot 2023-10-23 at 13 23 58

Question1: Why do you think the GPU performance drops significantly for images with a resolution of 1024x1024? Could it be due to the fact that the model was originally trained on 512x512 images, making it inefficient for the GPU to optimize larger images?

Question2: Another intriguing observation is the near-stagnant inference time on the CPU, regardless of the considerable reduction in image size. Transitioning from a 1024-sized image to a 256-sized one represents an 8-fold decrease in the input data volume. Yet, the inference time improvement is a mere 0.03 seconds.

One of my objectives is to develop a swift CPU-only version for head segmentation. Hence, these results took me by surprise.

As an initial step, I aimed to replicate the aforementioned inference times to ascertain I'm not overlooking any crucial aspects. For this, I trained the network employing the resnet34 architecture, limiting it to just 3 epochs. The image size, as specified in the config yaml file, remained unchanged at 512x512. Post-training, I loaded the latest checkpoint and retried the experiment described earlier. Below is the relevant code:


from time import time
import cv2
import torch
import head_segmentation.segmentation_pipeline as seg_pipeline
from prettytable import PrettyTable
import numpy as np

class CustomHeadSegmentationPipeline(seg_pipeline.HumanHeadSegmentationPipeline):
    def predict(self, image: np.ndarray, name) -> np.ndarray:
        t0=time()
        preprocessed_image = self._preprocess_image(image)
        t1 = time()
        preprocessed_image = preprocessed_image.to(self.device)
        t2 = time()
        mdl_out = self._model(preprocessed_image)
        t3 = time()
        mdl_out = mdl_out.cpu()
        t4 = time()
        pred_segmap = self._postprocess_model_output(mdl_out, original_image=image)
        t5= time()

        print(" ")
        print("Test details for :", name)
        print(" ")

        print("preprocessing",round(t1-t0,3))
        print("to cpu/gpu",round(t2-t1,3))
        print("model output",round(t3-t2,3))
        print("to cpu",round(t4-t3,3))
        print("postprocess",round(t5-t4,3))
        print("total",round(t5-t0,3))
        print("-------------")

        return pred_segmap


print("----Loading Test images----")
#img path for one of orignal celebA images (1024x1024)
image_path= "/home/enes/lab/head-segmentation/processed_dataset/test/images/1000.jpg"

image = cv2.imread(str(image_path), cv2.IMREAD_COLOR)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
print("test_img shape", image.shape)

image_512 = cv2.resize(image, (512, 512), interpolation=cv2.INTER_AREA)
image_256 = cv2.resize(image, (256, 256), interpolation=cv2.INTER_AREA)
print("resized_test_img (512,512) shape", image_512.shape)
print("resized_test_img (256,256) shape", image_256.shape)

print("----    ----")
print("  ")

print("----Check if GPU is available----")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("device:",device)
print("----    ----")
print("  ")


model_path_mobilenet_v2= "/home/enes/lab/head-segmentation/training_runs/2023-10-22/00-16/models/last.ckpt"
model_path_resnet34= "/home/enes/lab/head-segmentation/training_runs/2023-10-22/21-22/models/last.ckpt"

model_path=model_path_resnet34



segmentation_pipeline = CustomHeadSegmentationPipeline(model_path=model_path)
segmentation_pipeline_GPU = CustomHeadSegmentationPipeline(device=device, model_path=model_path)

t0=time()
name="1024 + CPU"
predicted_segmap = segmentation_pipeline.predict(image, name)
t1=time()
name="512 + CPU"
predicted_segmap = segmentation_pipeline.predict(image_512, name)
t2=time()
name="216 + CPU"
predicted_segmap = segmentation_pipeline.predict(image_256,name)
t3=time()
name="1024 + GPU"
predicted_segmap = segmentation_pipeline_GPU.predict(image,name)
t4=time()
name="512 + GPU"
predicted_segmap = segmentation_pipeline_GPU.predict(image_512, name)
t5=time()
name="256 + GPU"
predicted_segmap = segmentation_pipeline_GPU.predict(image_256, name)
t6=time()


print("Inference times for resnet34 --pretrained --depth=3 : ")
myTable = PrettyTable(["Image Size", "CPU", "GPU", ])

myTable.add_row(["1024", str(round(t1-t0,2))+" sec", str(round(t4-t3,2))+" sec"])
myTable.add_row(["512", str(round(t2-t1,2))+" sec", str(round(t5-t4,2))+" sec"])
myTable.add_row(["256", str(round(t3-t2,2))+" sec", str(round(t6-t5,2))+" sec"])

print(myTable)

And here you see the results for n1-standart CPU + NVIDIA T4 VM:

Screenshot 2023-10-23 at 13 28 00 Screenshot 2023-10-23 at 14 08 05

(this image shows time bottle neck indeed is model output part of the prosess. Total times are slightly different because i optained this detailed results while i run the test with a better cpu to check if there will be a big difference. )

Question3: So, in terms of CPU based inference time , although i am using the same machine, there are 4 times difference between training my model and doing a pip install to current repo. Can you point out what might be the difference with current pipeline model and what I trained?

karaposu avatar Oct 23 '23 10:10 karaposu

Hello @karaposu , I think it's not the image size affects the inference time, but loading the model to gpu will consume more time. Can you change the image order and test again? Besides, whatever you resize the image, it will be resized to 512 * 512 which is the model input size. You can find it in the "head_segmentation/segmentation_pipeline.py" image image

9527-csroad avatar Oct 26 '23 09:10 9527-csroad

Hello @9527-csroad I will look into what u said with more detail. But like you mentioned, model loading happens in head_segmentation/segmentation_pipeline.py and if you look at how i measure inference time in my code

segmentation_pipeline_GPU = CustomHeadSegmentationPipeline(device=device, model_path=model_path)
# first i am initing sp

t4=time()
name="512 + GPU"
predicted_segmap = segmentation_pipeline_GPU.predict(image_512, name)
t5=time()

#and then i am measuring the inference time. 

I am first initing segmentation_pipeline and then measuring the inference time.

Are you talking about soemthing else when you mention loading the model to gpu? (maybe there is a torch.load("CUDA") which i am not seeing?) But i would assume these would also happen during initing and wouldnt effect my measurement

Anfd about the image size issue, why we are having such huge inference time improvement with GPU? I will change the order and paste the results here but I am not clear about what different results do you expect. Do you expect to have same outputs ? (first image being tried always being slow regardless of the size?)

karaposu avatar Oct 26 '23 13:10 karaposu

@9527-csroad I run the test and I think you are right.

Screenshot 2023-10-26 at 20 02 03 Screenshot 2023-10-26 at 20 02 09

Can you also commend on question3 in the first post? I was suspecting image size of current available model was actually 256 but as you showed this is not the case. So why my trained version of current model works 3 times slower -with cpu-?

Also, can i get your contact email/discord ?

karaposu avatar Oct 26 '23 17:10 karaposu

Hey, just fyi, I'm currently swamped with work. I'll try to carve out some time this weekend to answer all the questions.

Best regards, Wiktor

wiktorlazarski avatar Oct 26 '23 18:10 wiktorlazarski

Hey, just fyi, I'm currently swamped with work. I'll try to carve out some time this weekend to answer all the questions.

Best regards, Wiktor

Hello Wiktor, Still swamped?

karaposu avatar Nov 02 '23 06:11 karaposu

Hey guys,

Sorry for late response, I'm currently traveling and don't have to much time for dealing with open source duties 😅. Sorry for that. @karaposu, I can see that together with @9527-csroad you are on a right track to resolve that issue. I'd say guys you do you and if you think that something can be improved please don't hesitate to make a PR with improvements and explanations what changed and how that affects the repo.

Best regards, Wiktor

wiktorlazarski avatar Nov 03 '23 13:11 wiktorlazarski

@wiktorlazarski sure wiktor. Still, about question 3 I need somekind of guidance from you. Once I have an answer for that I can close this issue

Question3: So, in terms of CPU based inference time , although i am using the same machine, there are 4 times difference between training my model and doing a pip install to current repo. Can you point out what might be the difference with current pipeline model and what I trained?

karaposu avatar Nov 05 '23 16:11 karaposu

Screenshot 2023-11-17 at 20 47 03 i just learned that depth parameter is about downsampling and this might cause the difference i mentioned

karaposu avatar Nov 17 '23 17:11 karaposu