ViLT
ViLT copied to clipboard
Time Calculation
Please could you please provide some details how you calculated the time for example here in this figure
Hi @menatallh,
For ResNets and region operations, I used this repo to measure relevant time. Note that the repo batches the images for faster processing, we tweaked the code to use batch size of 1 to measure the running time for a single image.
Both backbone and region operations are slower than conventional 224x224 image classification and COCO object detection since bottom-up attention requires (1) 800x1333 image resolution and (2) 1600 object classes. As written in our paper, you can refer to Jiang et al. 2020 (In Defense of Grid Features for Visual Question Answering) for more time-related information.
For transformers, the architecture itself is almost identical from model to model (BERT-base or ViT-B). So you can check it by adding some lines to the out-of-the-box demo, you can modify this line as following:
times = list()
for i in range(1000):
torch.cuda.synchronize()
tic = time.time()
infer = model(batch)
torch.cuda.synchronize()
times.append(time.time() - tic)
print(sum(times) / len(times))
In my P40 machine, it prints 0.013984010219573975.
hi dandelin,
do I get it right that you only compute the "feature extraction" and "model inference" time with one image? and the dataloader time is not in consideration?
infer = model(batch)
, another question is the batch size of batch
is 1 in your calculation?
Hi @junchen14
Yes, the times are calculated with a single image.
The above code snippet is from the demo script, so the batch size of batch
is 1.
Hi @junchen14
Yes, the times are calculated with a single image. The above code snippet is from the demo script, so the batch size of
batch
is 1.
thanks
Hello @dandelin ,
Great work you did here. I have also a doubt in this direction. Since these 15ms speak
for a single image-text
pair, This means that for text2image
retrieval you need to feed all the pairs in the database paired with the textual query right?
Or you did some preprocessing or some pre-optimization or pruning to avoid needing all this overhead?
Thank you very much
Hi @JoanFM
We need to process all pairs to do the cross-modal retrieval, and that is the main disadvantage shared by every VLP model. Some papers like this directly address this problem, but the retrieval speed is upper bounded by that of a dual-encoder structure like CLIP.
Thanks!