head-segmentation
head-segmentation copied to clipboard
Some Questions regarding inference time and current setup
Hello @wiktorlazarski ,
A couple of days ago I finished the installation and run the repo on linux VM with GPU support. I am inspecting the code for a while and I want to say I am learning a lot just by reading the code. It is so good that I want my personal project to be have very similar clean and configurable structure. Thanks again for creating this work.
Having said that, I do have some questions, and your insights would be highly appreciated.
Before I delve into them, let me give you a brief overview of my understanding of how the head-segmentation repo operates, and kindly rectify any inaccuracies.
For the model acrhitectures, this repo is dependent on the segmentation_models repo (https://github.com/qubvel/segmentation_models.pytorch). It sources pretrained encoder weights, specifically resnet34 or mobilenet_v2, from the segmentation_library. Subsequently, these encoder components are integrated with a standard UNet, transforming it into a segmentation model.
The current model uses finetuned resnet34 model. And mobilenet_v2 model weights are lost.
My main focus is on optimizing inference time. To break it down, inference time comprises:
- Preprocessing duration
- Transfer time of the image to the GPU
- Time taken for the model to process the image
- Time to transfer results back to the CPU
- Postprocessing duration
My primary interest lies in the third point, although I've also looked into the others for a comprehensive understanding.
------Lets start with current available pipeline-------
Here is my code to check inference time :
from time import time
import cv2
import torch
import head_segmentation.segmentation_pipeline as seg_pipeline
from prettytable import PrettyTable
print("----Loading Test images----")
#img path for one of orignal celebA images (1024x1024)
image_path= "/home/enes/lab/head-segmentation/processed_dataset/test/images/1000.jpg"
image = cv2.imread(str(image_path), cv2.IMREAD_COLOR)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
print("test_img shape", image.shape)
image_512 = cv2.resize(image, (512, 512), interpolation=cv2.INTER_AREA)
image_256 = cv2.resize(image, (256, 256), interpolation=cv2.INTER_AREA)
print("resized_test_img (512,512) shape", image_512.shape)
print("resized_test_img (256,256) shape", image_256.shape)
print("---- ----")
print(" ")
print("----Check if GPU is available----")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("device:",device)
print("---- ----")
print(" ")
segmentation_pipeline = seg_pipeline.HumanHeadSegmentationPipeline()
segmentation_pipeline_GPU = seg_pipeline.HumanHeadSegmentationPipeline(device=device)
t0=time()
predicted_segmap = segmentation_pipeline.predict(image)
t1=time()
predicted_segmap = segmentation_pipeline.predict(image_512)
t2=time()
predicted_segmap = segmentation_pipeline.predict(image_256)
t3=time()
predicted_segmap = segmentation_pipeline_GPU.predict(image)
t4=time()
predicted_segmap = segmentation_pipeline_GPU.predict(image_512)
t5=time()
predicted_segmap = segmentation_pipeline_GPU.predict(image_256)
t6=time()
myTable = PrettyTable(["Image Size", "CPU", "GPU", ])
myTable.add_row(["1024", str(round(t1-t0,2))+" sec", str(round(t4-t3,2))+" sec"])
myTable.add_row(["512", str(round(t2-t1,2))+" sec", str(round(t5-t4,2))+" sec"])
myTable.add_row(["256", str(round(t3-t2,2))+" sec", str(round(t6-t5,2))+" sec"])
print(myTable)
And here is the outputs:
Question1: Why do you think the GPU performance drops significantly for images with a resolution of 1024x1024? Could it be due to the fact that the model was originally trained on 512x512 images, making it inefficient for the GPU to optimize larger images?
Question2: Another intriguing observation is the near-stagnant inference time on the CPU, regardless of the considerable reduction in image size. Transitioning from a 1024-sized image to a 256-sized one represents an 8-fold decrease in the input data volume. Yet, the inference time improvement is a mere 0.03 seconds.
One of my objectives is to develop a swift CPU-only version for head segmentation. Hence, these results took me by surprise.
As an initial step, I aimed to replicate the aforementioned inference times to ascertain I'm not overlooking any crucial aspects. For this, I trained the network employing the resnet34 architecture, limiting it to just 3 epochs. The image size, as specified in the config yaml file, remained unchanged at 512x512. Post-training, I loaded the latest checkpoint and retried the experiment described earlier. Below is the relevant code:
from time import time
import cv2
import torch
import head_segmentation.segmentation_pipeline as seg_pipeline
from prettytable import PrettyTable
import numpy as np
class CustomHeadSegmentationPipeline(seg_pipeline.HumanHeadSegmentationPipeline):
def predict(self, image: np.ndarray, name) -> np.ndarray:
t0=time()
preprocessed_image = self._preprocess_image(image)
t1 = time()
preprocessed_image = preprocessed_image.to(self.device)
t2 = time()
mdl_out = self._model(preprocessed_image)
t3 = time()
mdl_out = mdl_out.cpu()
t4 = time()
pred_segmap = self._postprocess_model_output(mdl_out, original_image=image)
t5= time()
print(" ")
print("Test details for :", name)
print(" ")
print("preprocessing",round(t1-t0,3))
print("to cpu/gpu",round(t2-t1,3))
print("model output",round(t3-t2,3))
print("to cpu",round(t4-t3,3))
print("postprocess",round(t5-t4,3))
print("total",round(t5-t0,3))
print("-------------")
return pred_segmap
print("----Loading Test images----")
#img path for one of orignal celebA images (1024x1024)
image_path= "/home/enes/lab/head-segmentation/processed_dataset/test/images/1000.jpg"
image = cv2.imread(str(image_path), cv2.IMREAD_COLOR)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
print("test_img shape", image.shape)
image_512 = cv2.resize(image, (512, 512), interpolation=cv2.INTER_AREA)
image_256 = cv2.resize(image, (256, 256), interpolation=cv2.INTER_AREA)
print("resized_test_img (512,512) shape", image_512.shape)
print("resized_test_img (256,256) shape", image_256.shape)
print("---- ----")
print(" ")
print("----Check if GPU is available----")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("device:",device)
print("---- ----")
print(" ")
model_path_mobilenet_v2= "/home/enes/lab/head-segmentation/training_runs/2023-10-22/00-16/models/last.ckpt"
model_path_resnet34= "/home/enes/lab/head-segmentation/training_runs/2023-10-22/21-22/models/last.ckpt"
model_path=model_path_resnet34
segmentation_pipeline = CustomHeadSegmentationPipeline(model_path=model_path)
segmentation_pipeline_GPU = CustomHeadSegmentationPipeline(device=device, model_path=model_path)
t0=time()
name="1024 + CPU"
predicted_segmap = segmentation_pipeline.predict(image, name)
t1=time()
name="512 + CPU"
predicted_segmap = segmentation_pipeline.predict(image_512, name)
t2=time()
name="216 + CPU"
predicted_segmap = segmentation_pipeline.predict(image_256,name)
t3=time()
name="1024 + GPU"
predicted_segmap = segmentation_pipeline_GPU.predict(image,name)
t4=time()
name="512 + GPU"
predicted_segmap = segmentation_pipeline_GPU.predict(image_512, name)
t5=time()
name="256 + GPU"
predicted_segmap = segmentation_pipeline_GPU.predict(image_256, name)
t6=time()
print("Inference times for resnet34 --pretrained --depth=3 : ")
myTable = PrettyTable(["Image Size", "CPU", "GPU", ])
myTable.add_row(["1024", str(round(t1-t0,2))+" sec", str(round(t4-t3,2))+" sec"])
myTable.add_row(["512", str(round(t2-t1,2))+" sec", str(round(t5-t4,2))+" sec"])
myTable.add_row(["256", str(round(t3-t2,2))+" sec", str(round(t6-t5,2))+" sec"])
print(myTable)
And here you see the results for n1-standart CPU + NVIDIA T4 VM:
(this image shows time bottle neck indeed is model output part of the prosess. Total times are slightly different because i optained this detailed results while i run the test with a better cpu to check if there will be a big difference. )
Question3: So, in terms of CPU based inference time , although i am using the same machine, there are 4 times difference between training my model and doing a pip install to current repo. Can you point out what might be the difference with current pipeline model and what I trained?