peft LISA

Feature request

Suggestion: Add a "use_lisa" parameter to use this

Motivation

It's better than LoRA

Your contribution

There is an implementation here

Mar 28 '24 13:03 timothelaborie

Thanks for the pointer. Indeed, I had a look at this earlier today. IIUC, this works by freezing all the attention layer weights and only unfreezing a random subset of them for each step. The results as shown in the paper look promising.

This approach has a couple of disadvantages compared to existing PEFT techniques like LoRA, e.g.:

All the base model parameters are being trained, not just a small set of parameters. Therefore, the whole model must be saved. With LoRA et al, where we only need to save a relatively small set of weights.
There is no possibility to deactivate the fine-tuned weights dynamically (so no with model.disable_adapter()).
There is no possibility to load multiple adapters onto the same base model.

Despite these disadvantages, LISA could be useful for some users if they don't the mentioned features.

When it comes to training code, it gets a bit tricky. PEFT does not provide any training code on purpose, as this is out of scope and there are enough training frameworks out there already. However, we need to run the freezing/unfreezing function at each step. This cannot be achieved without adjusting the training code, so it is kinda not doable in PEFT.

What I imagine PEFT could do is to provide a callback similar to the linked code that can be used in conjunction with transformers Trainer and its subclasses. Then it would be up to the user to ensure that this callback is correctly used.

If you're interested in trying your hands on this, please let me know.

Mar 28 '24 13:03 BenjaminBossan

What I imagine PEFT could do is to provide a callback similar to the linked code that can be used in conjunction with transformers Trainer and its subclasses. Then it would be up to the user to ensure that this callback is correctly used.

you can almost use the code in their repo as is. this works for me:

lisa_activated_layers = 1
lisa_interval_steps = 20

from transformers import TrainerCallback
import numpy as np

# source: https://github.com/OptimalScale/LMFlow/blob/main/src/lmflow/pipeline/finetuner.py
class DynamicLayerActivationCallback(TrainerCallback):
    def __init__(self, n_layers, interval_steps, model):
        super().__init__()
        self.n_layers = n_layers
        self.interval_steps = interval_steps
        self.model = model
        # Determine the way to access layers based on the model type
        if self.model.__class__.__name__ == 'LlamaForCausalLM':
            self.layers_attribute = 'model.model.layers'  # Layer access path for LlamaForCausalLM
        else:
            self.layers_attribute = 'model.transformer.h'  # General access path
        self.total_layers = len(eval('self.' + self.layers_attribute))  # Dynamically execute to get the number of layers

        # Freeze all layers upon initialization
        self.freeze_all_layers()
        self.active_layers_indices = []

    def freeze_all_layers(self):
        layers = eval('self.' + self.layers_attribute)  # Dynamically execute to get layers
        for layer in layers:
            for param in layer.parameters():
                param.requires_grad = False

    def on_step_begin(self, args, state, control, **kwargs):
        # Check if it's time to switch active layers, including at step 0
        if state.global_step % self.interval_steps == 0 or state.global_step == 1:
            self.switch_active_layers()

    def switch_active_layers(self):
        # First, disable gradients for all layers
        self.freeze_all_layers()

        # Randomly select n_layers to activate
        layers = eval('self.' + self.layers_attribute)  # Re-fetch layer references
        self.active_layers_indices = np.random.choice(range(self.total_layers), self.n_layers, replace=False)
        print(f"Activating layers at indices: {self.active_layers_indices} for the next steps.")

        # Enable gradients only for the selected layers
        for idx in self.active_layers_indices:
            for param in layers[idx].parameters():
                param.requires_grad = True

# Instantiate the callback
dynamic_layer_activation_callback = DynamicLayerActivationCallback(
    n_layers=lisa_activated_layers,                     # Number of layers to activate
    interval_steps = lisa_interval_steps,               # Step interval to update active layers
    model = model
)

trainer.add_callback(dynamic_layer_activation_callback)

Mar 30 '24 11:03 geronimi73

you can almost use the code in their repo as is

Yes, I expected as much. We could think about adding this to PEFT (with proper reference to the original) to give LISA more exposure. We could also just add a mention to the docs with the cited pros and cons of LISA and a link to the snippet, I don't have a strong opinion.

Apr 02 '24 10:04 BenjaminBossan

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Apr 27 '24 15:04 github-actions[bot]

notice

Apr 30 '24 05:04 Trangle

So the implementation from LMFlow doesn't actually work https://github.com/OptimalScale/LMFlow/issues/726 Perhaps the optimizer states of the frozen layers could be stored on the CPU and then moved back to the GPU when needed?

Apr 30 '24 09:04 timothelaborie

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

May 24 '24 15:05 github-actions[bot]

peft peft copied to clipboard

LISA

Feature request

Motivation

Your contribution

peft
peft copied to clipboard