gritlm
gritlm copied to clipboard
Train llama 3.1 with GRIT
I'm now trying to train llama3.1 with GRIT pipeline.
At first I directly change --model_name_or_path and run the training code (the training script I used is as follows)
#!/bin/bash
#SBATCH --time=6:00:00
#SBATCH --job-name=grit_train
#SBATCH --gres=gpu:h100-96:2
#SBATCH --mem=60G
#SBATCH --output=/home/e/e1347696/unified_encoder_decoder/logs/grit_train_out.log
#SBATCH --error=/home/e/e1347696/unified_encoder_decoder/logs/grit_train_err.log
source ~/.bashrc
conda activate grit_eval
export CUDA_HOME='/usr/local/cuda-12.1'
# CUDA_VISIBLE_DEVICES=$(python train/gritlm/mig_uuid_setup.py)
export CUDA_VISIBLE_DEVICES=0,1
cd /home/e/e1347696/unified_encoder_decoder
# nvidia-smi
deepspeed \
--num_gpus=2 \
--module train.gritlm.training.run \
--output_dir results/GritLM-7B-training \
--model_name_or_path model/Llama-3.1-8B \
--train_data data/grit_training_data \
--max_example_num_per_dataset 1000 \
--learning_rate 2e-5 \
--lr_scheduler_type linear \
--warmup_ratio 0.03 \
--max_steps 1253 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 256 \
--per_device_generative_bs 32 \
--dataloader_drop_last \
--normalized \
--temperature 0.02 \
--train_group_size 2 \
--negatives_cross_device \
--query_max_len 256 \
--passage_max_len 1024 \
--mode unified \
--logging_steps 1 \
--bf16 \
--pooling_method mean \
--use_unique_indices \
--loss_gen_type mixed \
--attn bbcc \
--attn_implementation sdpa \
--no_gen_gas \
--gradient_checkpointing \
--save_steps 1000 \
--split_emb \
--deepspeed scripts/configs/config_8gpusds_m7.json
But there is an error TypeError: LlamaModel.forward() got an unexpected keyword argument 'is_causal'. I looked into it and found several issues regarding this #34, #32 and #19.
Just to confirm, if I want to train llama 3.1 model with GRIT, can I just
- reuse the provided modeling file directly by putting
modeling_gritlm7b.pyinto llama3.1 model folder or do I need to - change the modeling file for llama3.1 so that it could accept
is_causalarg and thus influence attention behavior?
Thank you so much!
change the modeling file for llama3.1 so that it could accept is_causal arg and thus influence attention behavior?
change the modeling file for llama3.1 so that it could accept is_causal arg and thus influence attention behavior?
I thought is_causal is an argument controlling whether we are using bidirectional attention in the model or not, since the original modeling file does not accept such argument, do we need to implement this for it?
I’m still learning, so please kindly correct me if I’m mistaken. Thank you so much!
yes you're right; i meant to say that that is the option you have to go with; sorry i should have removed the ?
yes you're right; i meant to say that that is the option you have to go with; sorry i should have removed the ?
Thank you so much! Nah you don't need to remove "the" it's just me don't have much confidence in that :)
I've checked the code for modeling_mistral_gritlm.py and had few other questions
- Do I need to modify other settings despite the implementation of
is_causalarg? - I noticed that flash attention is used in
modeling_mistral_gritlm.pybut not in llama3 modeling file. So if I'm going to implement my own version ofis_causalarg, can I just simply add the bidirectional component inAttentionclass?
Really appreciate your prompted guidance, even on weekends! You are indeed one of the most helpful and fastest responding authors I have contacted.
- only is_causal
- yeah any attention mechanism should be fine as long as you implement the masking for it