Oscar How long did you pertained the OSCAR/OSCAR+

The GPU resources and training time?

Jul 13 '21 03:07 FingerRec

@pzzhang Really appreciate it if you could clarify this!

GPU type, number of GPUs, memory of each GPU used, training time

The example command you provided in oscarplus-pretraining shows that the batch size is 8 -- to reproduce the results should we change back to 1024?

python -m torch.distributed.launch --nproc_per_node=8 oscar/run_oscarplus_pretrain.py \
    --use_b 1 \
    --max_grad_norm 10.0 --gradient_accumulation_steps 1 \
    --use_img_layernorm 1 \
    --output_dir <your output folder> \
    --bert_model bert --model_name_or_path bert-base-uncased \
    --do_lower_case --learning_rate 5e-05 
    --warmup_steps 0 --do_train --max_seq_length 35 --on_memory \
    --max_img_seq_length 50 --img_feature_dim 2054 \
    --drop_out 0.1 --train_batch_size 8 \
    --ckpt_period 10000 --max_iters 2000000 --log_period 100 \
    --data_dir <The input data dir that contain the .yaml files> --dataset_file coco_flickr30k_googlecc_gqa_sbu_oi_x152c4big2exp168.yaml
    --textb_sample_mode 1 --texta_false_prob 0.25

Aug 03 '21 09:08 coldmanck

Hi @FingerRec and @coldmanck , please change the batch size to 1024, which is the batch size reported in the paper.

Sorry that the batch size 8 here is just for debugging.

Aug 03 '21 21:08 pzzhang

@pzzhang Thank you for your reply! How about the training resources (gpu type/memory/time) you used?

I'm working on less capable GPUs, so I am also wondering how I should modify the batch size accordingly (and how would this affect the pretraining performance). Please let me know if you have any idea/advice on this! Thank you a lot :)

Aug 04 '21 03:08 coldmanck

Hi @pzzhang I see that I need ~22 days to pretrain OSCAR+ on 8 V100 GPUs, each 21GB memory occupied. As V100 has 32G memory available, I am considering increase the batch size to 1.5X larger (i.e., 1536) and maybe I'd only need 2/3 of the required steps. Do you think it's valid?

Moreover, I'd like you to clarify what do you mean by OSCAR+ is trained at least 1M steps. I see that max_iter is set as 2M by default, and I couldn't find code about early stopping here; so I am wondering what does at least mean here. How did you choose the final checkpoint to report the performance?

Also, the 2M max_iter (which affects the speed of linear learning rate decay). If we want to train, say, (at least?) 1M steps, shouldn't we modify it to 1M accordingly?

Aug 05 '21 03:08 coldmanck

@coldmanck
training resources: 16 V100 (32G), 1024 total batchsize, train for 2M iterations takes about 20 weeks. We do not use AMP, and the 32G memory is sufficient. If you use AMP, the memory and training time can be reduced.

We just evaluate the checkpoint in the middle, for every 100k iterations. We pick the best checkpoint among these evaluated checkpoints.

You do not need to modify it to 1M accordingly, because the learning rate is linear and it will be too small for the final few iterations.

Aug 05 '21 09:08 pzzhang

@pzzhang Thank you for your detailed reply. I really appreciate it! :)

I have another two critical questions

Oscar+ consists of two types of training samples, namely, <caption-tags-img_features> or <question-answer-img_features>. It seems natural to me that the examples of the former type should be used to predict polluted captions, while the latter should be used to predict polluted answers. However, it seems that your code does not distinguish the type of input examples when sampling a corrupted sentence. Therefore, chances are a QA pair gets question replaced or a caption-tags pair gets tags replaced. Have I missed anything? Quote from Page 7, "VinVL: Revisiting Visual Representations in Vision-Language Models" (arXiv ver.):

To classify whether a caption-tags-image triplet contains a polluted caption is a text-image matching task. To classify whether a question-answer- image triplet contains a polluted answer is an answer selection task for VQA.

It seems that your code performs both pretraining tasks (Masked Token Loss and Contrastive Loss) at the same time. According to the code, after randomly selecting token_1 or token_2 to be another arbitrary token (with 50% chance), words in bothtoken_1 and token_2 are getting randomly masked. However, as token_1 or token_2 might be a randomly sampled token, it doesn't make sense to, say, predict a masked word in token_1 according to a randomly sampled token_2, and vice versa. This also doesn't match the loss in your paper (minimizing log-likelihood log p(h_i|h_{\i}, v) since it's not h_{\i} anymore. Other BERT works such as UNITER seem to train each pretext task separately, which is more reasonable.

Aug 05 '21 16:08 coldmanck

@coldmanck Hello, When you did the pre-training, small corpus and large carpus take the same amount of time? When I tried，why is it so much slower to pre-train with a large corpus than with a small corpus?

Aug 16 '21 07:08 ckmstydy

Hi @pzzhang I see that I need ~22 days to pretrain OSCAR+ on 8 V100 GPUs, each 21GB memory occupied. As V100 has 32G memory available, I am considering increase the batch size to 1.5X larger (i.e., 1536) and maybe I'd only need 2/3 of the required steps. Do you think it's valid?

Moreover, I'd like you to clarify what do you mean by OSCAR+ is trained at least 1M steps. I see that max_iter is set as 2M by default, and I couldn't find code about early stopping here; so I am wondering what does at least mean here. How did you choose the final checkpoint to report the performance?

Also, the 2M max_iter (which affects the speed of linear learning rate decay). If we want to train, say, (at least?) 1M steps, shouldn't we modify it to 1M accordingly?

20 weeks? 140 days? Seriously? Or what you really want to say is 20 days?

Dec 23 '21 14:12 VivaLaDijkstra

Hi @pzzhang I see that I need ~22 days to pretrain OSCAR+ on 8 V100 GPUs, each 21GB memory occupied. As V100 has 32G memory available, I am considering increase the batch size to 1.5X larger (i.e., 1536) and maybe I'd only need 2/3 of the required steps. Do you think it's valid? Moreover, I'd like you to clarify what do you mean by OSCAR+ is trained at least 1M steps. I see that max_iter is set as 2M by default, and I couldn't find code about early stopping here; so I am wondering what does at least mean here. How did you choose the final checkpoint to report the performance? Also, the 2M max_iter (which affects the speed of linear learning rate decay). If we want to train, say, (at least?) 1M steps, shouldn't we modify it to 1M accordingly?

20 weeks? 140 days? Seriously? Or what you really want to say is 20 days?

Not sure if it'd be 20 days. 20 days would mean 20 x 86_400 / 2_000_000 = 0.864 seconds per iteration, but time for end-to-end inference itself is over half a second (last page of the paper), and I think it is for only one sample. Add to that loading data for a whole batch, computing gradients, updating parameters, logging, etc. hard to believe it'd take 0.864 seconds. Rather 20 weeks seems more reasonable to me, which comes out to be 0.864 x 7 ≈ 6 seconds per batch. But regardless, an insanely long time to pre-train a model; blew my mind!

PS. I'm an amateur, so might be wrong with my interpretations :')

Jun 23 '22 05:06 ignasa007

Oscar Oscar copied to clipboard

How long did you pertained the OSCAR/OSCAR+

Oscar
Oscar copied to clipboard