What does this PR do?

NEW FEATURE: ADD LISA ALGORITHM, SEE: https://arxiv.org/abs/2403.17919

Before submitting

[x] Did you read the contributor guideline?

Apr 02 '24 10:04 qibaoyuan

fixes: https://github.com/hiyouga/LLaMA-Factory/issues/3087

Apr 02 '24 14:04 hiyouga

Takes https://github.com/OptimalScale/LMFlow/issues/726

Apr 03 '24 06:04 hiyouga

When combining lisa with multiple GPUs, Zero3 and gradient checkpointing, it comes to the following error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cpu! (when checking argument for argument tensors in method wrapper_CUDA_cat)

return torch._C._nn.flatten_dense_tensors(tensors) single_grad_partition = self.flatten(self.averaged_gradients[sub_group_id]).to...

Apr 05 '24 14:04 yetionyo

I came up with the code below. The id of optimizer changes when call on_train_epoch_start. BAD THINGS: Still it needs lightning package installed and can only be preformed in a separate project/python file , something like this link: https://lightning.ai/lightning-ai/studios/code-lora-from-scratch .Further updates will be reported.


def on_train_epoch_start(self, trainer: "L.Trainer", pl_module: "pl.LightningModule"):
    if trainer.current_epoch % self.epoch_interval == 0:
        self.switch_active_layers()
        pl_module.optimizer_fn = torch.optim.Adam
        trainer.strategy.setup_optimizers(trainer)

Apr 07 '24 06:04 qibaoyuan

截屏2024-04-11 15 44 07

I have conducted experiments on llama2-7b using full, lisa_2, lisa_32 methods. From the image above, you can see that the train loss curve decreases and full is the same as lisa_32.

The latest code borrowed some impl from lmflow and axolotl. Some impl details are purified and debug option is given.

Hope this will be merged.

Apr 11 '24 09:04 qibaoyuan

I tried this and noticed that fine-tuning Qwen/Qwen1.5-0.5B consumes more than 18 GB VRAM with following config. Is this expected?

Config

#!/bin/bash

CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path Qwen/Qwen1.5-0.5B \
    --dataset mhqg_1k \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type full \
    --use_lisa \
    --lisa_activated_layers 2 \
    --lisa_interval_steps 5 \
    --output_dir ../../saves/Qwen1.5-0.5B/lisa/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 3192 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --plot_loss \
    --fp16

System Info

$ uname -a
Linux 6bf7eb606868 5.4.0-152-generic #169-Ubuntu SMP Tue Jun 6 22:23:09 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A5000               Off | 00000000:81:00.0 Off |                  Off |
| 30%   32C    P0              56W / 230W |      1MiB / 24564MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Apr 12 '24 05:04 neteroster

When I used LISA to fintune Llama-2-7b on alpaca-gpt4-en with one a100 80G，the used memory increased sharply and exceeded 80G. I want to know how to solve this problem...

Config： CUDA_VISIBLE_DEVICES=2 python src/train_bash.py
--stage sft
--do_train
--model_name_or_path meta-llama/Llama-2-7b-chat-hf
--dataset alpaca_gpt4_en
--dataset_dir data
--template default
--finetuning_type full
--use_lisa 1
--lisa_verbose 1
--lisa_activated_layers 2
--lisa_interval_steps 3
--output_dir saves/Llama-2-7b-chat-lisa-2-3
--overwrite_cache
--overwrite_output_dir
--cutoff_len 1024
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--logging_steps 5
--warmup_steps 0
--save_steps 30000
--learning_rate 5e-5
--num_train_epochs 1.0
--plot_loss
--fp16

Error：

GPU info when running

Apr 19 '24 16:04 lovekdl

@neteroster Hello. I have the same problem with you, have you solved it？

Apr 23 '24 09:04 lovekdl

@lovekdl Not yet.

Apr 23 '24 15:04 neteroster

LLaMA-Factory
LLaMA-Factory copied to clipboard

[FEATURE: ADD LISA ALGORITHM]

What does this PR do?

Before submitting

LLaMA-Factory LLaMA-Factory copied to clipboard

[FEATURE: ADD LISA ALGORITHM]

What does this PR do?

Before submitting

LLaMA-Factory
LLaMA-Factory copied to clipboard