Open-Assistant Add LoRA and Prefix-Tuning as Modeling Options for Improved Memory Efficiency + performance (potentially)

Add LoRA and Prefix-Tuning as Modeling Options for Improved Memory Efficiency + performance (potentially)

Open jordiclive opened this issue 1 year ago • 8 comments

This PR adds LoRA and prefix-tuning as modelling options (training and sampling code).

Both have shown strong performance and can outperform fine-tuning. They also can protect against the catastrophic forgetting problem which is important for chatbots. They keep the whole language model frozen so they can be distributed freely independent of the base language model.

They also allow much more memory-efficient training as there is no need for the optimizer states of the base model.

Benefits of LoRA

Can run only DS 2 with 30B.
Run 65B without intense CPU usage
Less overfitting, maintains performance on datasets.
Only need to share/push to hub the OS component (small file in Mbs)

— See Andrej Karpathy (OpenAI) comment — See purported google leak

Implementation Details:

Explicitly set input_ids as a keyword argument in the sampling code to ensure proper functionality with PEFT generate.
Manually enable gradients for input to leverage gradient checkpointing effectively, as frozen embeddings would otherwise prevent gradient attachment.
Include saving code from pytorch_model.bin for special tokens. Although these tokens are randomly initialized, they must be stored and saved as an external module since PEFT parameters learn to utilize them. Making them trainable is an option, but it is unlikely to make a significant difference.
Integrate prefix-tuning, a powerful technique that involves stacking keys and values, by incorporating custom LLama modelling code. (Training speed lower than LoRA in my initial tests and performance similar/worse).

Apr 22 '23 12:04 jordiclive

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

Apr 22 '23 12:04 github-actions[bot]

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

Apr 22 '23 12:04 github-actions[bot]

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

Apr 22 '23 13:04 github-actions[bot]

@jordiclive would this PR make it possible to load a PEFT model for inference in the chat?

Apr 25 '23 19:04 smeyerhot

@smeyerhot This code is currently just for model training and evaluation. But should be trivial to load it for inference, it uses the same HF generate method.

Apr 26 '23 22:04 jordiclive

@andreaskoepf I am going to run a 30B LoRA model just on the sft datasets and will post the sampling report.

Apr 29 '23 18:04 jordiclive

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

May 06 '23 10:05 github-actions[bot]

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

May 06 '23 10:05 github-actions[bot]

Open-Assistant Open-Assistant copied to clipboard

Add LoRA and Prefix-Tuning as Modeling Options for Improved Memory Efficiency + performance (potentially)

Open-Assistant
Open-Assistant copied to clipboard