Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Supervised finetuning: check parameter-efficient finetuning for large models

Open justheuristic opened this issue 2 years ago • 10 comments

[based on discord conversation with @yk , Huu Nguyen , @christophschuhmann ; please edit if i got something wrong]

Hypothesis: it might be that training adapters to a much larger model would result in a better model than training a small (e.g. 6B) model fully.

How to check the hypothesis?:

We already have an issue for supervised training (#48), so we can borrow setup from there.

We can test with one of two public models available in both 6-11B and 100B+ sizes: BLOOM or OPT (please tell me if i missed something).

  • H0: take the 6-11B model and fine-tune all parameters
  • H1: take a larger model and train with one of the following methods:
    • [1] Prefix Tuning https://arxiv.org/abs/2101.00190
    • [2] LoRA https://arxiv.org/abs/2106.09685
    • [3] IA3 / T-few https://arxiv.org/abs/2205.05638

... and then check how does it affect the quality of the trained model.

If the large model is better by a sufficient margin, @yk mentioned that it might be better to distill it to a smaller size, rather than training 6-11B in full. I've also added several alternatives in the next section.

Related concerns

How to fit a larger model onto user's GPU?: @TimDettmers is working on extreme quantization methods that could help run large model on a small GPU, based on this earlier work

Third alternative for hosting large models model: if we are unable to distill large models, another way to avoid paying for a supercomoputer is to ask contributors to run the model in a distributed fashion, like petals or ai-horde

Licensing: @yk mentioned that both bloom, bloomZ and opt have licenses that might rule them out for open-assistant. Huu Nguyen mentioned that larger models without license are coming.

justheuristic avatar Dec 31 '22 15:12 justheuristic

thank you very much for the elaborations, I agree the plan is very suitable. We absolutely need to know how much the difference between small and large models is!

From my limited view, OPT seems more suitable of the two, both because of the license, and I also believe performance-wise it is a bit better. Correct me if I'm wrong

yk avatar Dec 31 '22 15:12 yk

We can hopefully check compare the performance by testing both. Suggestions on [good alternative 100B+ models that also have 6-11B versions] or [parameter-efficient finetuning methods that might be of interest here] are very welcome.

justheuristic avatar Dec 31 '22 15:12 justheuristic

I think that this is a promising direction, however I see the numbers mentioned in the conversation and I wonder: Is there a PEFT method that would allow us to swap a 6B model for a 100B model on the same hardware setup? For instance, the LoRA paper abstract states that:

LoRA can reduce the GPU memory requirement by 3 times.

As far as I'm aware, the bottleneck is that PEFT methods still need to have a full copy of the model in memory for the forward pass.

mrcabbage972 avatar Dec 31 '22 22:12 mrcabbage972

just in addition, I believe the simplest PEFT setup is to just freeze most of the model and let the optimizer update only some parameters - for example, BitFit https://arxiv.org/abs/2106.10199 does this by updating only all bias params.

I find this approach simpler than using Adapters, etc., because the training needs minimal adjustments compared to standard training/finetuning, and the underlying model - architecture / HF class / model config - is exactly the same. Not sure if anyone did a systematic study about the best selection of the parameters to update though.

prompteus avatar Jan 01 '23 11:01 prompteus

I have a related project and reproduce the LoRA. It did reduce the GPU memory cost since it largely reduce the optimized parameters on GPU and allow to offload all other other parameters. Empirically, it is better to replace the dense layer with largest parameters with extra Low rank matrix.

Desein-Yang avatar Jan 03 '23 10:01 Desein-Yang

The Parallel Adapter and MAM adapter proposed here (https://arxiv.org/abs/2110.04366) may works too and have a better performance on parameter-efficient tuning.

Desein-Yang avatar Jan 03 '23 10:01 Desein-Yang

Is there anyone working on this now ?

ekurtulus avatar Jan 11 '23 18:01 ekurtulus

[I am, will share status in a few days; if you wanna contribute, i'd be happy to chat, just ping me]

justheuristic avatar Jan 13 '23 15:01 justheuristic

[I am, will share status in a few days; if you wanna contribute, i'd be happy to chat, just ping me]

@justheuristic, are you on the discord server, if so may I ask your username ?

ekurtulus avatar Jan 29 '23 20:01 ekurtulus

Yep, i'm Yozh there, happy to chat

justheuristic avatar Feb 01 '23 09:02 justheuristic