segment-anything icon indicating copy to clipboard operation
segment-anything copied to clipboard

latency in set_image function

Open mhyeonsoo opened this issue 1 year ago • 4 comments

Hi,

Thanks for the great sources. I can see the amazing performances. When I implemented the code and tried to run the model, I could see that it takes long time at the 'set_image()' function.

I am just wondering if it is because of transformation functions is ImageEncoderViT method. If so, is it supposed to take relatively long time?

for me, the elapsed time for each task was,

DINO model loading: 0.829
ViT encoder process: 17.912 
SAM model prediction: 0.192

Thanks again,

mhyeonsoo avatar Apr 12 '23 02:04 mhyeonsoo

What type of GPU are you using?

nikhilaravi avatar Apr 12 '23 07:04 nikhilaravi

Thanks for the excellent project, I have the same problem, the set_image function takes about 17s.

os: Ubuntu 20.04
gpu: Tesla t4 16G
model type: vit_h
CUDA Version: 11.4

thanks again

liuzz07 avatar Apr 14 '23 06:04 liuzz07

Thanks for the excellent project. I have a question about the gpu memory of set_image function. When i load the model to the gpu device sam.to(device=device), i find it occupies 3403MiB of the GPU using vit_h model. But when i execute set_image, the gpu memory increased to 7573MiB. Not sure why picture vectors take up so much gpu memory?

Thanks for any help.

obitoquilt avatar Apr 14 '23 08:04 obitoquilt

Same question for me. my device is ubuntu22 RTX3090 my code is

s4 = time.time()
input_label = np.array([1] * len(input_point), dtype=np.float32)
predictor.set_image(img)
print("set image time:", 1000*(time.time() - s4))
s5 = time.time()
masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True,
)
print("segment time:", 1000*(time.time() - s5))

the command output is

set image time: 93.09029579162598
segment time: 15.57469367980957

using sam_vit_b_01ec64.pth

KAWAKO-in-GAYHUB avatar Apr 17 '23 04:04 KAWAKO-in-GAYHUB

@nikhilaravi, I am using Nvidia 3080Ti

mhyeonsoo avatar Apr 18 '23 01:04 mhyeonsoo

@nikhilaravi, I am using Nvidia 3080Ti

How did you achieve the VIT encoder process of 17.912 ms on the 3080ti? Inference on my 3090 takes 93.09 ms😭 my CPU is i7-9700KF

KAWAKO-in-GAYHUB avatar Apr 18 '23 01:04 KAWAKO-in-GAYHUB

@KAWAKO-in-GAYHUB Hi did you run with the GPU? It seems like your code is running with CPU rather than GPU. check with the nvidia-smi during the running.

mhyeonsoo avatar Apr 18 '23 01:04 mhyeonsoo

predictor.set_image(image) is the step generates image embedding

To use the ONNX model, the image must first be pre-processed using the SAM image encoder

- this is the main thread of model because it's a big backbone. After that, we could predict much more faster with our input points, box, mask P/s: What we actualy convert to onnx is the last few layers of sam_vit_h_4b8939.pth

hungtooc avatar Apr 18 '23 01:04 hungtooc

@hungtooc Thanks for your comment, For me, other processes including pre-processing and transofrmation does not take that much long, and most of the time is taken at the line of,

self.features = self.model.image_encoder(input_image)

I also would like to look over onnx conversion part. Thanks!

mhyeonsoo avatar Apr 18 '23 02:04 mhyeonsoo

@KAWAKO-in-GAYHUB Hi did you run with the GPU? It seems like your code is running with CPU rather than GPU. check with the nvidia-smi during the running.

Thank you for your reply!

I'm sure my code is running on the GPU, and I wrote a demo on jupyter notebook to verify it. image image

I averaged 117.42 ms after running predictor.set_image(image) and predictor.predict(...) 1000 times. In addition, I also run predictor.set_image(image) 1000 times to get the average value, which is 106.37 ms. image

I don't know what's wrong in my code.

KAWAKO-in-GAYHUB avatar Apr 18 '23 02:04 KAWAKO-in-GAYHUB

@KAWAKO-in-GAYHUB That may be correct. My time up there is sec, not ms. Sorry I didn't metion above.

mhyeonsoo avatar Apr 18 '23 02:04 mhyeonsoo

@hungtooc Thanks for your comment, For me, other processes including pre-processing and transofrmation does not take that much long, and most of the time is taken at the line of,

self.features = self.model.image_encoder(input_image)

that's normal. you can see it mentioned in paper

A heavyweight image encoder outputs an image embedding image

hungtooc avatar Apr 18 '23 02:04 hungtooc

@hungtooc Got it. I missed that sentence! I will read more details. Thanks for answering :)

mhyeonsoo avatar Apr 18 '23 02:04 mhyeonsoo

@KAWAKO-in-GAYHUB That may be correct. My time up there is sec, not ms. Sorry I didn't metion above.

There shouldn't be such a big gap between the 3080ti and 3090 running code (17s and 90ms). I don't know what your code looks like.

KAWAKO-in-GAYHUB avatar Apr 18 '23 02:04 KAWAKO-in-GAYHUB

@hungtooc Thanks for your comment, For me, other processes including pre-processing and transofrmation does not take that much long, and most of the time is taken at the line of,

self.features = self.model.image_encoder(input_image)

that's normal. you can see it mentioned in paper

A heavyweight image encoder outputs an image embedding image

So my understanding is that its application scenario is more suitable for multiple inference of one picture, rather than real-time inference of video frames. Right?

KAWAKO-in-GAYHUB avatar Apr 18 '23 02:04 KAWAKO-in-GAYHUB

@KAWAKO-in-GAYHUB I not sure, maybe he could help you https://github.com/facebookresearch/segment-anything/issues/107#issuecomment-1500909850

hungtooc avatar Apr 18 '23 02:04 hungtooc

@KAWAKO-in-GAYHUB I not sure, maybe he could help you #107 (comment)

awesome, thank you!

KAWAKO-in-GAYHUB avatar Apr 18 '23 02:04 KAWAKO-in-GAYHUB

hi, we have proposed a method for rapid 'segment anything', using just 2% of the SA-1B dataset. It achieves precision comparable to SAM in edge detection (AP, .794 vs .793) and proposal generation tasks (mask AR@1000, 49.7 vs 51.8. E32). Additionally, our model is 50 times faster than SAM-H E32. The model is very simple, primarily adopting the yolov8seg structure. We welcome everyone to try it out, github: https://github.com/CASIA-IVA-Lab/FastSAM, arxiv: https://arxiv.org/pdf/2306.12156.pdf

berry-ding avatar Jun 22 '23 06:06 berry-ding