LLaVA issue Finetuning llava-v1.6-34b model

Describe the issue

Issue: I am finetuning llava-v1.6-34b on a some data, about 75,866 images with a resolution of 750X750 per image. I have tried finetuuning with a100 80GB (6 devices, 8 devices) and the h100 80GB (2, devices, 6 devices and 8 devices) when training it just exits and runs into out of memory error, or it just gives another error which I will provide.

whats are the specifications to successfully train the llava-v1.6-34b model, or is there any reason for the issue

Command:

PASTE THE COMMANDS HERE.
deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path liuhaotian/llava-v1.6-34b \
    --version v1 \
    --data_path ./training001/metadata.json \
    --image_folder ./ \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir ./checkpoints/llava-v1.6-34b-task \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

Log:

PASTE THE LOGS HERE.
1.
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
Parameter Offload: Total persistent parameters: 1213440 in 369 params
Traceback (most recent call last):
  File "/workspace/LLaVA/llava/train/train_mem.py", line 4, in <module>
    train(attn_implementation="flash_attention_2")
  File "/workspace/LLaVA/llava/train/train.py", line 969, in train
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1687, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1198, in prepare
    result = self._prepare_deepspeed(*args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1537, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 171, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 304, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1234, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1563, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 362, in __init__
    self._setup_for_real_optimizer()
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 465, in _setup_for_real_optimizer
    self._create_fp32_partitions()
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 854, in _create_fp32_partitions
    self.device).clone().float().detach())
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.16 GiB. GPU 0 has a total capacty of 79.15 GiB of which 1.16 GiB is free. Process 3400995 has 77.98 GiB memory in use. Of the allocated memory 74.87 GiB is allocated by PyTorch, and 2.48 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-03-21 10:19:08,590] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1464
[2024-03-21 10:19:08,591] [ERROR] [launch.py:321:sigkill_handler] ['/usr/bin/python', '-u', 'llava/train/train_mem.py', '--local_rank=0', '--deepspeed', './scripts/zero3.json', '--model_name_or_path', 'liuhaotian/llava-v1.6-34b', '--version', 'v1', '--data_path', './training001/metadata.json', '--image_folder', './training001', '--vision_tower', 'openai/clip-vit-large-patch14-336', '--mm_projector_type', 'mlp2x_gelu', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--group_by_modality_length', 'True', '--bf16', 'True', '--output_dir', './checkpoints/llava-v1.6-34b-task', '--num_train_epochs', '1', '--per_device_train_batch_size', '16', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'wandb'] exits with return code = 1

2.
[2024-03-25 14:13:32,240] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3772
[2024-03-25 14:13:32,979] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3773
[2024-03-25 14:13:32,981] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3774
[2024-03-25 14:13:33,154] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3775
[2024-03-25 14:13:33,156] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3776
[2024-03-25 14:13:33,156] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3777
[2024-03-25 14:13:33,157] [ERROR] [launch.py:321:sigkill_handler] ['/usr/bin/python', '-u', 'llava/train/train_mem.py', '--local_rank=5', '--de
epspeed', './scripts/zero3.json', '--model_name_or_path', 'liuhaotian/llava-v1.6-34b', '--version', 'v1', '--data_path', './training001/metadat
a.json', '--image_folder', './', '--vision_tower', 'openai/clip-vit-large-patch14-336', '--mm_projector_type', 'mlp2x_gelu', '--mm_vision_selec
t_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--group_by_modality_lengt
h', 'True', '--bf16', 'True', '--output_dir', './checkpoints/llava-v1.6-34b-task', '--num_train_epochs', '1', '--per_device_train_batch_size',
'4', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--s
ave_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_ty
pe', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_wo
rkers', '4', '--lazy_preprocess', 'True', '--report_to', 'wandb'] exits with return code = 1
root@04b9ed35b384:/workspace/LLaVA#

Screenshots: You may attach screenshots if it better explains the issue.

Mar 25 '24 17:03 adabadaramola

@adabadaramola I have a bit different issue. Can you please help me with it. I had followed the same fine tune script, i am getting error about not able to import llava module while executing train_mem.py file.

If you can help me with the code/script which you have used to start with finetning it will be helpfull to me.

Thanks

Mar 26 '24 04:03 Vish2427

@adabadaramola I didn't realize that v1.6 was fine-tunable yet. Were you able to fine-tune a smaller 7b model instead?

Mar 26 '24 18:03 spillai

@adabadaramola I have a bit different issue. Can you please help me with it. I had followed the same fine tune script, i am getting error about not able to import llava module while executing train_mem.py file.

If you can help me with the code/script which you have used to start with finetning it will be helpfull to me.

Thanks

You need to do pip install from the llava folder using the command pip install -e .

Mar 26 '24 18:03 anidh

I used 3 A100 80GB gpus for 1.6-34b and 1 A100 80GB for 1.6-mistral-7b. note: I've only tried this for low rank fine-tuning, not full! https://github.com/arielnlee/LLaVA-1.6-ft

Mar 27 '24 20:03 arielnlee

@adabadaramola I didn't realize that v1.6 was fine-tunable yet. Were you able to fine-tune a smaller 7b model instead?

yeah I saw a guy finetune v1.6 on youtube for the smaller 7b, no I did not try that, felt the 34b was more suitable for my usecase

Mar 27 '24 23:03 adabadaramola

I used 3 A100 80GB gpus for 1.6-34b and 1 A100 80GB for 1.6-mistral-7b. note: I've only tried this for low rank fine-tuning, not full! https://github.com/arielnlee/LLaVA-1.6-ft

Thanks for this, I will try it out

Mar 27 '24 23:03 adabadaramola

I used 3 A100 80GB gpus for 1.6-34b and 1 A100 80GB for 1.6-mistral-7b. note: I've only tried this for low rank fine-tuning, not full! https://github.com/arielnlee/LLaVA-1.6-ft

thanks the script works, I have been trying to evaluate after training I have been getting errors, please can you help or tell me how you go about it, I already have the checkpoints in the output dir

Mar 31 '24 05:03 adabadaramola

Hey, @arielnlee Do you have a notebook for fine-tuning 1.6-34b?

Mar 31 '24 14:03 Iven2132

I used 3 A100 80GB gpus for 1.6-34b and 1 A100 80GB for 1.6-mistral-7b. note: I've only tried this for low rank fine-tuning, not full! https://github.com/arielnlee/LLaVA-1.6-ft

thanks the script works, I have been trying to evaluate after training I have been getting errors, please can you help or tell me how you go about it, I already have the checkpoints in the output dir

Do you mean the model is outputting errors or when you try to run you get errors? Before evaluating, I merge the LoRA weights back onto the base model. Then I eval on the “merged” fine-tune. There’s a python file in scripts that you can use to merge.

Mar 31 '24 14:03 arielnlee

Hey, @arielnlee Do you have a notebook for fine-tuning 1.6-34b?

I don’t, but I can throw one together this week!

Mar 31 '24 14:03 arielnlee

Hey, @arielnlee Do you have a notebook for fine-tuning 1.6-34b?

I don’t, but I can throw one together this week!

It would be great, I'd love to chat with you. What's your email? Btw I contacted you on your website :)

Mar 31 '24 14:03 Iven2132

Describe the issue

Issue: I am finetuning llava-v1.6-34b on a some data, about 75,866 images with a resolution of 750X750 per image. I have tried finetuuning with a100 80GB (6 devices, 8 devices) and the h100 80GB (2, devices, 6 devices and 8 devices) when training it just exits and runs into out of memory error, or it just gives another error which I will provide.

whats are the specifications to successfully train the llava-v1.6-34b model, or is there any reason for the issue

Command:

PASTE THE COMMANDS HERE.
deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path liuhaotian/llava-v1.6-34b \
    --version v1 \
    --data_path ./training001/metadata.json \
    --image_folder ./ \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir ./checkpoints/llava-v1.6-34b-task \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

Log:

PASTE THE LOGS HERE.
1.
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
Parameter Offload: Total persistent parameters: 1213440 in 369 params
Traceback (most recent call last):
  File "/workspace/LLaVA/llava/train/train_mem.py", line 4, in <module>
    train(attn_implementation="flash_attention_2")
  File "/workspace/LLaVA/llava/train/train.py", line 969, in train
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1687, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1198, in prepare
    result = self._prepare_deepspeed(*args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1537, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 171, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 304, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1234, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1563, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 362, in __init__
    self._setup_for_real_optimizer()
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 465, in _setup_for_real_optimizer
    self._create_fp32_partitions()
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 854, in _create_fp32_partitions
    self.device).clone().float().detach())
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.16 GiB. GPU 0 has a total capacty of 79.15 GiB of which 1.16 GiB is free. Process 3400995 has 77.98 GiB memory in use. Of the allocated memory 74.87 GiB is allocated by PyTorch, and 2.48 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-03-21 10:19:08,590] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1464
[2024-03-21 10:19:08,591] [ERROR] [launch.py:321:sigkill_handler] ['/usr/bin/python', '-u', 'llava/train/train_mem.py', '--local_rank=0', '--deepspeed', './scripts/zero3.json', '--model_name_or_path', 'liuhaotian/llava-v1.6-34b', '--version', 'v1', '--data_path', './training001/metadata.json', '--image_folder', './training001', '--vision_tower', 'openai/clip-vit-large-patch14-336', '--mm_projector_type', 'mlp2x_gelu', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--group_by_modality_length', 'True', '--bf16', 'True', '--output_dir', './checkpoints/llava-v1.6-34b-task', '--num_train_epochs', '1', '--per_device_train_batch_size', '16', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'wandb'] exits with return code = 1

2.
[2024-03-25 14:13:32,240] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3772
[2024-03-25 14:13:32,979] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3773
[2024-03-25 14:13:32,981] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3774
[2024-03-25 14:13:33,154] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3775
[2024-03-25 14:13:33,156] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3776
[2024-03-25 14:13:33,156] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3777
[2024-03-25 14:13:33,157] [ERROR] [launch.py:321:sigkill_handler] ['/usr/bin/python', '-u', 'llava/train/train_mem.py', '--local_rank=5', '--de
epspeed', './scripts/zero3.json', '--model_name_or_path', 'liuhaotian/llava-v1.6-34b', '--version', 'v1', '--data_path', './training001/metadat
a.json', '--image_folder', './', '--vision_tower', 'openai/clip-vit-large-patch14-336', '--mm_projector_type', 'mlp2x_gelu', '--mm_vision_selec
t_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--group_by_modality_lengt
h', 'True', '--bf16', 'True', '--output_dir', './checkpoints/llava-v1.6-34b-task', '--num_train_epochs', '1', '--per_device_train_batch_size',
'4', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--s
ave_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_ty
pe', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_wo
rkers', '4', '--lazy_preprocess', 'True', '--report_to', 'wandb'] exits with return code = 1
root@04b9ed35b384:/workspace/LLaVA#

Screenshots: You may attach screenshots if it better explains the issue.

Hi, base on my experience in finetuning yi 34b. It seems like your need use batch size1 and zero3_offload

Apr 03 '24 00:04 Linziyang1999

I used 3 A100 80GB gpus for 1.6-34b and 1 A100 80GB for 1.6-mistral-7b. note: I've only tried this for low rank fine-tuning, not full! https://github.com/arielnlee/LLaVA-1.6-ft

Hi! LLaVA train vit parameters in 2 training strategy, but they didn't release their ViT parameters. Could you teach me how to solve this promble. Or just using raw clip parameter. Thanks!

Apr 03 '24 00:04 Linziyang1999

Hey, @arielnlee Do you have a notebook for fine-tuning 1.6-34b?

I don’t, but I can throw one together this week!

Hey, @arielnlee Let me know if you got something :)

Apr 05 '24 15:04 Iven2132

Hey, @arielnlee Do you have a notebook for fine-tuning 1.6-34b?

I don’t, but I can throw one together this week!

@arielnlee How did you fine-tuned LLaVA 1.6 34b? Do you have any resources for this?

Apr 07 '24 10:04 jsm69

Hey, @arielnlee Do you have a notebook for fine-tuning 1.6-34b?

I don’t, but I can throw one together this week!

Hey, @arielnlee Let me know if you got something :)

Apologies, the week got away from me, but it's still on my list. In the meantime it should work by using the repo!

Apr 08 '24 01:04 arielnlee

Hey, @arielnlee Do you have a notebook for fine-tuning 1.6-34b?

I don’t, but I can throw one together this week!

@arielnlee How did you fine-tuned LLaVA 1.6 34b? Do you have any resources for this?

I have a question. I've been trying to fine-tune llava 7b, everything works but the results did not change at all. I just wanted it to label one image to what I fine tune it with, it still recognize the image as something else.

How big is your fine-tuning dataset? And what's the task? For my specific use-case, the scripts work well. The size of my dataset is ~20k.

Apr 08 '24 01:04 arielnlee

Hey, @arielnlee Do you have a notebook for fine-tuning 1.6-34b?

I don’t, but I can throw one together this week!

Hey, @arielnlee Let me know if you got something :)

Apologies, the week got away from me, but it's still on my list. In the meantime it should work by using the repo!

@arielnlee I'm curious - About how many lines of examples to fine-tune LLaVA to get results? Also, can you make a notebook this week?

Apr 08 '24 13:04 Iven2132

Hey, @arielnlee Do you have a notebook for fine-tuning 1.6-34b?

I don’t, but I can throw one together this week!

Hey, @arielnlee Let me know if you got something :)

Apologies, the week got away from me, but it's still on my list. In the meantime it should work by using the repo!

@arielnlee I'm curious - About how many lines of examples to fine-tune LLaVA to get results? Also, can you make a notebook this week?

I am trying to feed it the same image that i used for finetuning but it still predict it wrong .. am i getting anything wrong

Apr 10 '24 22:04 moaldeen

Hi, how do you know the training was effective? Did you use the default training setting? I LoRA with default parameters and basically no improvement.

Apr 25 '24 13:04 fisher75

Hi, how do you know the training was effective? Did you use the default training setting? I LoRA with default parameters and basically no improvement.

I don't think the training script for 1.5 works for 1.6 at the moment. I looked into the llava/train/train.py, llava/model/builder.py, and llava/model/langauge_model, and noticed that they are not compatible with training 1.6. For example, I found that even though I tried to finetune llava 1.6 Mistral, the training file initiated a llava llama for me, because in the train.py, only llama and mpt instance were told to be initiated. I think if you want to fine-tune 1.6, you need to change many of the files manually.

Apr 25 '24 16:04 songchx24

Hi, how do you know the training was effective? Did you use the default training setting? I LoRA with default parameters and basically no improvement.

I don't think the training script for 1.5 works for 1.6 at the moment. I looked into the llava/train/train.py, llava/model/builder.py, and llava/model/langauge_model, and noticed that they are not compatible with training 1.6. For example, I found that even though I tried to finetune llava 1.6 Mistral, the training file initiated a llava llama for me, because in the train.py, only llama and mpt instance were told to be initiated. I think if you want to fine-tune 1.6, you need to change many of the files manually.

@songchx24 Check here: https://github.com/arielnlee/LLaVA-1.6-ft

Apr 25 '24 17:04 arielnlee

Hi, how do you know the training was effective? Did you use the default training setting? I LoRA with default parameters and basically no improvement.

I don't think the training script for 1.5 works for 1.6 at the moment. I looked into the llava/train/train.py, llava/model/builder.py, and llava/model/langauge_model, and noticed that they are not compatible with training 1.6. For example, I found that even though I tried to finetune llava 1.6 Mistral, the training file initiated a llava llama for me, because in the train.py, only llama and mpt instance were told to be initiated. I think if you want to fine-tune 1.6, you need to change many of the files manually.

@songchx24 Check here: https://github.com/arielnlee/LLaVA-1.6-ft

So this you have successfully changed code to fine-tune the 1.6? Nice! Thanks for the info!

Btw, may I ask it can support all 1.6 version lora or just Mistral?

Apr 25 '24 17:04 fisher75

Hi, how do you know the training was effective? Did you use the default training setting? I LoRA with default parameters and basically no improvement.

I don't think the training script for 1.5 works for 1.6 at the moment. I looked into the llava/train/train.py, llava/model/builder.py, and llava/model/langauge_model, and noticed that they are not compatible with training 1.6. For example, I found that even though I tried to finetune llava 1.6 Mistral, the training file initiated a llava llama for me, because in the train.py, only llama and mpt instance were told to be initiated. I think if you want to fine-tune 1.6, you need to change many of the files manually.

OK that explains a lot, because I tried my 1.6 it basically has no improvement after lora with main repo. Btw @arielnlee has shown a repo I think maybe it already has someone who changed the code to make it suitable for 1.6 finetune.

Apr 25 '24 17:04 fisher75

If someone has fine-tuned models with sizes of 1.6B( 7B, and 13B ), can you mention the minimum hardware requirements?

Jul 17 '24 13:07 babuus

If someone has fine-tuned models with sizes of 1.6B( 7B, and 13B ), can you mention the minimum hardware requirements?

@babuus Not sure the minimum requirements, but seems 1 A100 80G works referring to https://github.com/haotian-liu/LLaVA/issues/1335#issuecomment-2023922331

Jul 24 '24 21:07 Yuanyuan-Shen

LLaVA LLaVA copied to clipboard

issue Finetuning llava-v1.6-34b model

Describe the issue

Describe the issue

LLaVA
LLaVA copied to clipboard