ViLT Time Calculation

Please could you please provide some details how you calculated the time for example here in this figure

Sep 26 '21 17:09 menatallh

Hi @menatallh,

For ResNets and region operations, I used this repo to measure relevant time. Note that the repo batches the images for faster processing, we tweaked the code to use batch size of 1 to measure the running time for a single image.

Both backbone and region operations are slower than conventional 224x224 image classification and COCO object detection since bottom-up attention requires (1) 800x1333 image resolution and (2) 1600 object classes. As written in our paper, you can refer to Jiang et al. 2020 (In Defense of Grid Features for Visual Question Answering) for more time-related information.

For transformers, the architecture itself is almost identical from model to model (BERT-base or ViT-B). So you can check it by adding some lines to the out-of-the-box demo, you can modify this line as following:

times = list()
for i in range(1000):
    torch.cuda.synchronize()
    tic = time.time()
    infer = model(batch)
    torch.cuda.synchronize()
    times.append(time.time() - tic)
print(sum(times) / len(times))

In my P40 machine, it prints 0.013984010219573975.

Sep 27 '21 05:09 dandelin

hi dandelin,

do I get it right that you only compute the "feature extraction" and "model inference" time with one image? and the dataloader time is not in consideration?

infer = model(batch), another question is the batch size of batch is 1 in your calculation?

Sep 27 '21 17:09 junchen14

Hi @junchen14

Yes, the times are calculated with a single image. The above code snippet is from the demo script, so the batch size of batch is 1.

Sep 28 '21 05:09 dandelin

Hi @junchen14

Yes, the times are calculated with a single image. The above code snippet is from the demo script, so the batch size of batch is 1.

thanks

Oct 03 '21 07:10 junchen14

Hello @dandelin ,

Great work you did here. I have also a doubt in this direction. Since these 15ms speak for a single image-text pair, This means that for text2image retrieval you need to feed all the pairs in the database paired with the textual query right?

Or you did some preprocessing or some pre-optimization or pruning to avoid needing all this overhead?

Thank you very much

Dec 01 '21 07:12 JoanFM

Hi @JoanFM

We need to process all pairs to do the cross-modal retrieval, and that is the main disadvantage shared by every VLP model. Some papers like this directly address this problem, but the retrieval speed is upper bounded by that of a dual-encoder structure like CLIP.

Thanks!

Dec 01 '21 07:12 dandelin

ViLT ViLT copied to clipboard

Time Calculation

ViLT
ViLT copied to clipboard