Oscar
Oscar copied to clipboard
How long did you pertained the OSCAR/OSCAR+
The GPU resources and training time?
@pzzhang Really appreciate it if you could clarify this!
- GPU type, number of GPUs, memory of each GPU used, training time
The example command you provided in oscarplus-pretraining shows that the batch size is 8 -- to reproduce the results should we change back to 1024?
python -m torch.distributed.launch --nproc_per_node=8 oscar/run_oscarplus_pretrain.py \
--use_b 1 \
--max_grad_norm 10.0 --gradient_accumulation_steps 1 \
--use_img_layernorm 1 \
--output_dir <your output folder> \
--bert_model bert --model_name_or_path bert-base-uncased \
--do_lower_case --learning_rate 5e-05
--warmup_steps 0 --do_train --max_seq_length 35 --on_memory \
--max_img_seq_length 50 --img_feature_dim 2054 \
--drop_out 0.1 --train_batch_size 8 \
--ckpt_period 10000 --max_iters 2000000 --log_period 100 \
--data_dir <The input data dir that contain the .yaml files> --dataset_file coco_flickr30k_googlecc_gqa_sbu_oi_x152c4big2exp168.yaml
--textb_sample_mode 1 --texta_false_prob 0.25
Hi @FingerRec and @coldmanck , please change the batch size to 1024, which is the batch size reported in the paper.
Sorry that the batch size 8 here is just for debugging.
@pzzhang Thank you for your reply! How about the training resources (gpu type/memory/time) you used?
I'm working on less capable GPUs, so I am also wondering how I should modify the batch size accordingly (and how would this affect the pretraining performance). Please let me know if you have any idea/advice on this! Thank you a lot :)
Hi @pzzhang I see that I need ~22 days to pretrain OSCAR+ on 8 V100 GPUs, each 21GB memory occupied. As V100 has 32G memory available, I am considering increase the batch size to 1.5X larger (i.e., 1536) and maybe I'd only need 2/3 of the required steps. Do you think it's valid?
Moreover, I'd like you to clarify what do you mean by OSCAR+ is trained at least 1M steps. I see that max_iter
is set as 2M by default, and I couldn't find code about early stopping here; so I am wondering what does at least mean here. How did you choose the final checkpoint to report the performance?
Also, the 2M max_iter
(which affects the speed of linear learning rate decay). If we want to train, say, (at least?) 1M steps, shouldn't we modify it to 1M accordingly?
@coldmanck
training resources: 16 V100 (32G), 1024 total batchsize, train for 2M iterations takes about 20 weeks. We do not use AMP, and the 32G memory is sufficient. If you use AMP, the memory and training time can be reduced.
We just evaluate the checkpoint in the middle, for every 100k iterations. We pick the best checkpoint among these evaluated checkpoints.
You do not need to modify it to 1M accordingly, because the learning rate is linear and it will be too small for the final few iterations.
@pzzhang Thank you for your detailed reply. I really appreciate it! :)
I have another two critical questions
- Oscar+ consists of two types of training samples, namely, <caption-tags-img_features> or <question-answer-img_features>. It seems natural to me that the examples of the former type should be used to predict polluted captions, while the latter should be used to predict polluted answers. However, it seems that your code does not distinguish the type of input examples when sampling a corrupted sentence. Therefore, chances are a QA pair gets question replaced or a caption-tags pair gets tags replaced. Have I missed anything? Quote from Page 7, "VinVL: Revisiting Visual Representations in Vision-Language Models" (arXiv ver.):
To classify whether a caption-tags-image triplet contains a polluted caption is a text-image matching task. To classify whether a question-answer- image triplet contains a polluted answer is an answer selection task for VQA.
- It seems that your code performs both pretraining tasks (Masked Token Loss and Contrastive Loss) at the same time. According to the code, after randomly selecting
token_1
ortoken_2
to be another arbitrary token (with 50% chance), words in bothtoken_1
andtoken_2
are getting randomly masked. However, astoken_1
ortoken_2
might be a randomly sampled token, it doesn't make sense to, say, predict a masked word intoken_1
according to a randomly sampledtoken_2
, and vice versa. This also doesn't match the loss in your paper (minimizing log-likelihoodlog p(h_i|h_{\i}, v)
since it's not h_{\i} anymore. Other BERT works such as UNITER seem to train each pretext task separately, which is more reasonable.
@coldmanck Hello, When you did the pre-training, small corpus and large carpus take the same amount of time? When I tried,why is it so much slower to pre-train with a large corpus than with a small corpus?
Hi @pzzhang I see that I need ~22 days to pretrain OSCAR+ on 8 V100 GPUs, each 21GB memory occupied. As V100 has 32G memory available, I am considering increase the batch size to 1.5X larger (i.e., 1536) and maybe I'd only need 2/3 of the required steps. Do you think it's valid?
Moreover, I'd like you to clarify what do you mean by OSCAR+ is trained at least 1M steps. I see that
max_iter
is set as 2M by default, and I couldn't find code about early stopping here; so I am wondering what does at least mean here. How did you choose the final checkpoint to report the performance?Also, the 2M
max_iter
(which affects the speed of linear learning rate decay). If we want to train, say, (at least?) 1M steps, shouldn't we modify it to 1M accordingly?
20 weeks? 140 days? Seriously? Or what you really want to say is 20 days?
Hi @pzzhang I see that I need ~22 days to pretrain OSCAR+ on 8 V100 GPUs, each 21GB memory occupied. As V100 has 32G memory available, I am considering increase the batch size to 1.5X larger (i.e., 1536) and maybe I'd only need 2/3 of the required steps. Do you think it's valid? Moreover, I'd like you to clarify what do you mean by OSCAR+ is trained at least 1M steps. I see that
max_iter
is set as 2M by default, and I couldn't find code about early stopping here; so I am wondering what does at least mean here. How did you choose the final checkpoint to report the performance? Also, the 2Mmax_iter
(which affects the speed of linear learning rate decay). If we want to train, say, (at least?) 1M steps, shouldn't we modify it to 1M accordingly?20 weeks? 140 days? Seriously? Or what you really want to say is 20 days?
Not sure if it'd be 20 days. 20 days would mean 20 x 86_400 / 2_000_000 = 0.864 seconds per iteration, but time for end-to-end inference itself is over half a second (last page of the paper), and I think it is for only one sample. Add to that loading data for a whole batch, computing gradients, updating parameters, logging, etc. hard to believe it'd take 0.864 seconds. Rather 20 weeks seems more reasonable to me, which comes out to be 0.864 x 7 ≈ 6 seconds per batch. But regardless, an insanely long time to pre-train a model; blew my mind!
PS. I'm an amateur, so might be wrong with my interpretations :')