tevatron
tevatron copied to clipboard
Problem with contrastive loss in pretrain stage
Thanks for your great work. I meet the problem when using the same hyperparameters in NQ example pre-train on the second stage like coCondenser (we call uptrain stage with contrastive loss). Our template includes 1 query, 1 positive and 10 negative passages with our custom dataloader using a streaming mode dataset (dataset includes two languages with 25M triplet samples), our model based on bert-base-multilingual-cased has been continuing pretrain with MLM loss curve. It seems pre-train on contrastive loss can not be converged, here is the training script
python -m torch.distributed.launch --nproc_per_node=8 -m asymmetric.train \
--model_name_or_path 'asymmetric/checkpoint-10000' \
--streaming \
--output $saved_path \
--do_train \
--train_dir 'data/train' \
--max_steps 10000 \
--per_device_train_batch_size 32 \
--dataset_num_proc 2 \
--train_n_passages 8 \
--gc_q_chunk_size 8 \
--gc_p_chunk_size 64 \
--untie_encoder \
--negatives_x_device \
--learning_rate 5e-4 \
--weight_decay 1e-2 \
--warmup_ratio 0.1 \
--save_steps 1000 \
--save_total_limit 20 \
--logging_steps 50 \
--q_max_len 128 \
--p_max_len 384 \
--fp16 \
--report_to 'wandb' \
--overwrite_output_dir
To understand this better, can you elaborate on what hardware is used?