latency in set_image function
Hi,
Thanks for the great sources. I can see the amazing performances. When I implemented the code and tried to run the model, I could see that it takes long time at the 'set_image()' function.
I am just wondering if it is because of transformation functions is ImageEncoderViT method. If so, is it supposed to take relatively long time?
for me, the elapsed time for each task was,
DINO model loading: 0.829
ViT encoder process: 17.912
SAM model prediction: 0.192
Thanks again,
What type of GPU are you using?
Thanks for the excellent project, I have the same problem, the set_image function takes about 17s.
os: Ubuntu 20.04
gpu: Tesla t4 16G
model type: vit_h
CUDA Version: 11.4
thanks again
Thanks for the excellent project.
I have a question about the gpu memory of set_image function.
When i load the model to the gpu device sam.to(device=device), i find it occupies 3403MiB of the GPU using vit_h model.
But when i execute set_image, the gpu memory increased to 7573MiB.
Not sure why picture vectors take up so much gpu memory?
Thanks for any help.
Same question for me.
my device is
ubuntu22 RTX3090
my code is
s4 = time.time()
input_label = np.array([1] * len(input_point), dtype=np.float32)
predictor.set_image(img)
print("set image time:", 1000*(time.time() - s4))
s5 = time.time()
masks, scores, logits = predictor.predict(
point_coords=input_point,
point_labels=input_label,
multimask_output=True,
)
print("segment time:", 1000*(time.time() - s5))
the command output is
set image time: 93.09029579162598
segment time: 15.57469367980957
using sam_vit_b_01ec64.pth
@nikhilaravi, I am using Nvidia 3080Ti
@nikhilaravi, I am using Nvidia 3080Ti
How did you achieve the VIT encoder process of 17.912 ms on the 3080ti? Inference on my 3090 takes 93.09 ms😭 my CPU is i7-9700KF
@KAWAKO-in-GAYHUB Hi did you run with the GPU? It seems like your code is running with CPU rather than GPU. check with the nvidia-smi during the running.
predictor.set_image(image) is the step generates image embedding
To use the ONNX model, the image must first be pre-processed using the SAM image encoder
- this is the main thread of model because it's a big backbone.
After that, we could predict much more faster with our input points, box, mask
P/s: What we actualy convert to onnx is the last few layers of sam_vit_h_4b8939.pth
@hungtooc Thanks for your comment, For me, other processes including pre-processing and transofrmation does not take that much long, and most of the time is taken at the line of,
self.features = self.model.image_encoder(input_image)
I also would like to look over onnx conversion part. Thanks!
@KAWAKO-in-GAYHUB Hi did you run with the GPU? It seems like your code is running with CPU rather than GPU. check with the nvidia-smi during the running.
Thank you for your reply!
I'm sure my code is running on the GPU, and I wrote a demo on jupyter notebook to verify it.

I averaged 117.42 ms after running predictor.set_image(image) and predictor.predict(...) 1000 times.
In addition, I also run predictor.set_image(image) 1000 times to get the average value, which is 106.37 ms.

I don't know what's wrong in my code.
@KAWAKO-in-GAYHUB That may be correct. My time up there is sec, not ms. Sorry I didn't metion above.
@hungtooc Thanks for your comment, For me, other processes including pre-processing and transofrmation does not take that much long, and most of the time is taken at the line of,
self.features = self.model.image_encoder(input_image)
that's normal. you can see it mentioned in paper
A heavyweight image encoder outputs an image embedding
@hungtooc Got it. I missed that sentence! I will read more details. Thanks for answering :)
@KAWAKO-in-GAYHUB That may be correct. My time up there is sec, not ms. Sorry I didn't metion above.
There shouldn't be such a big gap between the 3080ti and 3090 running code (17s and 90ms). I don't know what your code looks like.
@hungtooc Thanks for your comment, For me, other processes including pre-processing and transofrmation does not take that much long, and most of the time is taken at the line of,
self.features = self.model.image_encoder(input_image)that's normal. you can see it mentioned in paper
A heavyweight image encoder outputs an image embedding
So my understanding is that its application scenario is more suitable for multiple inference of one picture, rather than real-time inference of video frames. Right?
@KAWAKO-in-GAYHUB I not sure, maybe he could help you https://github.com/facebookresearch/segment-anything/issues/107#issuecomment-1500909850
hi, we have proposed a method for rapid 'segment anything', using just 2% of the SA-1B dataset. It achieves precision comparable to SAM in edge detection (AP, .794 vs .793) and proposal generation tasks (mask AR@1000, 49.7 vs 51.8. E32). Additionally, our model is 50 times faster than SAM-H E32. The model is very simple, primarily adopting the yolov8seg structure. We welcome everyone to try it out, github: https://github.com/CASIA-IVA-Lab/FastSAM, arxiv: https://arxiv.org/pdf/2306.12156.pdf
