LLaVA-NeXT
LLaVA-NeXT copied to clipboard
When will the training code open sourced?
Thanks for your great job. When will the training code open sourced?
We definitely have the plan to opensource everything (including previous LLaVA-NeXT's training code, data) to benefit community.
However, there're still lot of works to do and more releases to be expected~
Can you please tell me if there is any obvious difference between the training code and the one in https://github.com/haotian-liu/LLaVA? I'm trying to finetune it on my data set.
There's no major difference, if you wish to finetune our 8b/72b/110b. Only diff is that you need to apply new conversation templates.
Indeed, I found that the special tokens of llama3-8B are different from those of vicuna-7B. Could you please tell me what the conversation template is like you use to finetune llama3-8B?
you can find in llava/conversations.py, we use llama-3.
But it's for inference, in training, you need to implement the masking logic for llama-3 your side.
OK thanks, I'll try it.
Hi,
From your provided checkpoint (https://huggingface.co/lmms-lab/llama3-llava-next-8b), I found that the pre-trained config is
PROMPT_VERSION=plain
PRETRAIN_DATA_VERSION="blip558k"
So I referred to https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/pretrain.sh to pre-train with LLaMA3-8B backend as the following script
deepspeed llava/train/train_mem.py \
--deepspeed ./scripts/zero2.json \
--model_name_or_path ckpts/Meta-Llama-3-8B \
--version plain \
--data_path ./playground/data/LLaVA-Pretrain/blip_laion_cc_sbu_558k.json \
--image_folder ./playground/data/LLaVA-Pretrain/images \
--vision_tower ckpts/clip-vit-large-patch14-336 \
--mm_projector_type mlp2x_gelu \
--tune_mm_mlp_adapter True \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--bf16 True \
--output_dir ./checkpoints/llava-v1.5-llama-8b-pretrain \
--num_train_epochs 1 \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 24000 \
--save_total_limit 1 \
--learning_rate 1e-3 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to tensorboard
However, the training loss is much larger than pre-trained with vicuna-7B (2.40 v.s. 2.06).
To adapt LLaMA3-8B to the training code in https://github.com/haotian-liu/LLaVA, I manually add pad_token and unk_token like
if tokenizer.pad_token is None:
print("\n add unk_token \n\n")
smart_tokenizer_and_embedding_resize(
special_tokens_dict=dict(
unk_token="<unk>"
),
tokenizer=tokenizer,
model=model,
)
tokenizer.pad_token = tokenizer.unk_token
I wonder whether it will cause bad pre-train performance or not? If this is not correct, how can I set padding_value in this line input_ids = torch.nn.utils.rnn.pad_sequence()?
@jimchenhub @Luodian I found the training scripts was publiced in the HF model README https://huggingface.co/lmms-lab/llama3-llava-next-8b
Any way to actually run this training script?
Hi @jimchenhub have you figured it out?