MGM icon indicating copy to clipboard operation
MGM copied to clipboard

怎么从你们的检查点启动呢

Open HongLouyemeng opened this issue 10 months ago • 16 comments

我发现教程只有从开源大模型的权重微调的,如果要在你们的检查点启动使用什么方法呢

HongLouyemeng avatar Apr 14 '24 05:04 HongLouyemeng

Hi, I'm not sure what's the meaning of "start from checkpoint". Did you mean continue training from the pre-trained checkpoint?

yanwei-li avatar Apr 15 '24 03:04 yanwei-li

Hi, I'm not sure what's the meaning of "start from checkpoint". Did you mean continue training from the pre-trained checkpoint?

Hi,没错QAQ 我可能不太专业哈,同样的基础模型,你们微调了2个阶段后,我只想微调你们用数据预训练后的基础模型,同样的型号,只要用同样的bash文件吗?QAQ 我是这么理解的。 比如LLAMA和miniGemini-Llama,我想微调的是miniGemini的-llama

HongLouyemeng avatar Apr 15 '24 03:04 HongLouyemeng

同问

GoldenFishes avatar Apr 15 '24 07:04 GoldenFishes

同问

我估摸着是这样,还是等作者回答把

HongLouyemeng avatar Apr 15 '24 08:04 HongLouyemeng

I have the same question. Could author write a document to explain it?

zhiting-wang avatar Apr 15 '24 11:04 zhiting-wang

` from minigemini.model.builder import load_pretrained_model

loal Mini-Gemini-2B

tokenizer, model, image_processor, context_len = load_pretrained_model(model_path="work_dirs/Mini-Gemini-2B", model_base=None, model_name='YanweiLi/Mini-Gemini-2B', load_8bit=False, load_4bit=False, device_map="auto", device="cuda", use_flash_attn=False) `

GoldenFishes avatar Apr 16 '24 01:04 GoldenFishes

` from minigemini.model.builder import load_pretrained_model

loal Mini-Gemini-2B

tokenizer, model, image_processor, context_len = load_pretrained_model(model_path="work_dirs/Mini-Gemini-2B", model_base=None, model_name='YanweiLi/Mini-Gemini-2B', load_8bit=False, load_4bit=False, device_map="auto", device="cuda", use_flash_attn=False) `

https://huggingface.co/YanweiLi/Mini-Gemini-8x7B/blob/main/config.json elif "mixtral" in model_args.model_name_or_path:

HongLouyemeng avatar Apr 16 '24 03:04 HongLouyemeng

Hi, I guess you can try this

FINETUNE_NAME=Mini-Gemini-7B
STAGE3_NAME=Your_prefer_name
AUX_SIZE=768
deepspeed minigemini/train/train_mem.py \
    --deepspeed ./scripts/zero2_offload.json \
    --model_name_or_path ./work_dirs/$FINETUNE_NAME \
    --version v1 \
    --data_path ./data/MiniGemini-Finetune/minigemini_instruction.json \
    --image_folder ./data/MiniGemini-Finetune \
    --vision_tower model_zoo/OpenAI/clip-vit-large-patch14-336 \
    --vision_tower_aux model_zoo/OpenAI/openclip-convnext-large-d-320-laion2B-s29B-b131K-ft-soup \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --image_size_aux $AUX_SIZE \
    --bf16 True \
    --output_dir ./work_dirs/$STAGE3_NAME \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

yanwei-li avatar Apr 16 '24 14:04 yanwei-li

Hi, I guess you can try this

FINETUNE_NAME=Mini-Gemini-7B
STAGE3_NAME=Your_prefer_name
AUX_SIZE=768
deepspeed minigemini/train/train_mem.py \
    --deepspeed ./scripts/zero2_offload.json \
    --model_name_or_path ./work_dirs/$FINETUNE_NAME \
    --version v1 \
    --data_path ./data/MiniGemini-Finetune/minigemini_instruction.json \
    --image_folder ./data/MiniGemini-Finetune \
    --vision_tower model_zoo/OpenAI/clip-vit-large-patch14-336 \
    --vision_tower_aux model_zoo/OpenAI/openclip-convnext-large-d-320-laion2B-s29B-b131K-ft-soup \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --image_size_aux $AUX_SIZE \
    --bf16 True \
    --output_dir ./work_dirs/$STAGE3_NAME \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

Following this operation, the following warning is reported. Is this normal? I see that the code does not load the weights of vision_tower and vision_tower_aux when loading MiniGeminiLlamaForCausalLM.from_trained, but the weights are reloaded from clip/openclip later. Isn’t that the same as the weights of vision_tower and vision_tower_aux in your pre-trained MiniGeminiLlamaForCausalLM? Are the weights inconsistent? (Of course, if you haven’t fine-tuned vision_tower and vision_tower_aux, it should be fine, but it doesn’t feel very standardized. Why not load the weights together with MiniGeminiLlamaForCausalLM.from_trained)

Some weights of the model checkpoint at model_zoo/YanweiLi/Mini-Gemini-34B-HD were not used when initializing MiniGeminiLlamaForCausalLM: ['model.vision_tower.vision_tower.vision_model.embeddings.class_embedding', 'model.vision_tower.vision_tower.vision_model.embeddings.patch_embedding.weight

xylcbd avatar Apr 17 '24 02:04 xylcbd

Hi, I guess you can try this

FINETUNE_NAME=Mini-Gemini-7B
STAGE3_NAME=Your_prefer_name
AUX_SIZE=768
deepspeed minigemini/train/train_mem.py \
    --deepspeed ./scripts/zero2_offload.json \
    --model_name_or_path ./work_dirs/$FINETUNE_NAME \
    --version v1 \
    --data_path ./data/MiniGemini-Finetune/minigemini_instruction.json \
    --image_folder ./data/MiniGemini-Finetune \
    --vision_tower model_zoo/OpenAI/clip-vit-large-patch14-336 \
    --vision_tower_aux model_zoo/OpenAI/openclip-convnext-large-d-320-laion2B-s29B-b131K-ft-soup \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --image_size_aux $AUX_SIZE \
    --bf16 True \
    --output_dir ./work_dirs/$STAGE3_NAME \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

Following this operation, the following warning is reported. Is this normal? I see that the code does not load the weights of vision_tower and vision_tower_aux when loading MiniGeminiLlamaForCausalLM.from_trained, but the weights are reloaded from clip/openclip later. Isn’t that the same as the weights of vision_tower and vision_tower_aux in your pre-trained MiniGeminiLlamaForCausalLM? Are the weights inconsistent? (Of course, if you haven’t fine-tuned vision_tower and vision_tower_aux, it should be fine, but it doesn’t feel very standardized. Why not load the weights together with MiniGeminiLlamaForCausalLM.from_trained)

Some weights of the model checkpoint at model_zoo/YanweiLi/Mini-Gemini-34B-HD were not used when initializing MiniGeminiLlamaForCausalLM: ['model.vision_tower.vision_tower.vision_model.embeddings.class_embedding', 'model.vision_tower.vision_tower.vision_model.embeddings.patch_embedding.weight

这个问题可以哦

HongLouyemeng avatar Apr 17 '24 03:04 HongLouyemeng

请问有谁把finetuning 流程跑完了 ?

Erickrus avatar Apr 18 '24 01:04 Erickrus

同问,有没有把微调的全部流程全部完成的

a2382625920 avatar Apr 19 '24 07:04 a2382625920

我这边六张4090 24G+300G内存跑不动

HongLouyemeng avatar Apr 19 '24 07:04 HongLouyemeng

微调这么吓人吗?,虽然推理单张A100就足够了,内存小点只有60G

a2382625920 avatar Apr 19 '24 07:04 a2382625920

而且我这边使用的是之前微调LLAVA的数据集,然后报错了,还没到加载模型这一步 [2024-04-19 15:17:04,563] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) Traceback (most recent call last): File "/opt/conda/envs/minigemini/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 437, in main subprocess.check_call(ssh_check_cmd, stderr=subprocess.DEVNULL, stdout=subprocess.DEVNULL, shell=True) File "/opt/conda/envs/minigemini/lib/python3.10/subprocess.py", line 369, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'ssh -o PasswordAuthentication=no your_ip_0 hostname' returned non-zero exit status 255.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/conda/envs/minigemini/bin/deepspeed", line 6, in main() File "/opt/conda/envs/minigemini/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 439, in main raise RuntimeError( RuntimeError: Using hostfile at hostfile_4 but host=your_ip_0 was not reachable via ssh. If you are running with a single node please remove hostfile_4 or setup passwordless ssh.

a2382625920 avatar Apr 19 '24 07:04 a2382625920

单节点删除hostfile,我怀疑是zero2不行导致异构内存没利用好

HongLouyemeng avatar Apr 19 '24 07:04 HongLouyemeng

问题解决了。

HongLouyemeng avatar Apr 26 '24 12:04 HongLouyemeng