peft icon indicating copy to clipboard operation
peft copied to clipboard

Comparison of Different Fine-Tuning Techniques for Conversational AI

Open ImamaDev opened this issue 1 year ago • 42 comments

Feature request

It would be incredibly helpful to have a clear comparison or support for various fine-tuning techniques specifically for conversational AI. This feature could include insights into their strengths, limitations, and ideal use cases, helping practitioners choose the right approach for their needs.

Here’s a list of techniques to consider:

LoRa AdaLoRa BONE VeRa XLora LN Tuning VbLora HRA (Hyperparameter Regularization Adapter) IA3 (Input-Aware Adapter) Llama Adapter CPT (Conditional Prompt Tuning)etc

Motivation

With the growing number of fine-tuning techniques for conversational AI, it can be challenging to identify the most suitable approach for specific use cases. A comprehensive comparison of these techniques—highlighting their strengths, limitations, and ideal scenarios—would save time, reduce trial-and-error, and empower users to make informed decisions. This feature would bridge the gap between research and practical application, enabling more effective model customization and deployment.

Your contribution

I’d be happy to collaborate on this! While I might not have a complete solution right now, I’m willing to contribute by gathering resources, reviewing papers, or helping organize comparisons. If others are interested in teaming up, we could work together on a PR to make this feature happen. Let’s connect and brainstorm how we can tackle this effectively!

ImamaDev avatar Jan 07 '25 07:01 ImamaDev

Thanks for coming up with this proposal. Indeed, this is something we have on our backlog for a long time. As you can imagine, providing objective and useful information on this is a huge undertaking, since relying on the paper results can often be problematic.

As a long term project, we plan to provide some kind of benchmark that compares all these methods in terms of runtime, memory usage, performance, etc. but I can't give any concrete date yet.

In the meantime, we have started to be more rigorous when new methods are being added in requiring a clear description of what the best use cases are. There is still a lot of room for improvement, especially when it comes to methods that were added some time ago.

If you (and others) want to contribute, I think a good place to start would be to go through the individual methods in the PEFT docs and help improve the descriptions. If we can make them more uniform, with more details on the best uses cases, pros and cons, this would already be a nice improvement.

image

There are other places that could benefit from such a clean up, e.g. the description of all the LoRA initialization methods.

BenjaminBossan avatar Jan 07 '25 10:01 BenjaminBossan

I would be interested to contribute as well

sparsh2 avatar Jan 07 '25 22:01 sparsh2

I would be interested to contribute as well

Thanks for the offer. As mentioned, as a first step, we could use some help with updating the "blurbs" of the PEFT methods. For this, it's often sufficient to read a couple of section from the paper. If anyone wants to work on one such method, please announce it here so that there is no duplicate work.

BenjaminBossan avatar Jan 08 '25 09:01 BenjaminBossan

How about having a sample fine-tuning script for each method and comparing different approaches for different tasks?

imcoza avatar Jan 15 '25 15:01 imcoza

How about having a sample fine-tuning script for each method and comparing different approaches for different tasks?

I'm not 100% sure what you mean, but let's start with a single task and then we can expand from there. We haven't come up with such a task yet, but we have some criteria:

  1. It should be a task that is supported by all methods (most likely language model fine-tuning)
  2. The task should be kinda realistic and practical
  3. The task should not take too long to run and should not require expensive hardware
  4. Training code should be easy to adapt for real training (should have example character)

Maybe we can find something from the trl examples that we can adopt.

BenjaminBossan avatar Jan 16 '25 11:01 BenjaminBossan

Hello @BenjaminBossan, @ImamaDev.

I hope this message finds you well. I am interested in contributing to this PR and would like your insights on the following:

  1. "It should be a task that is supported by all methods (most likely language model fine-tuning)" - Could you please suggest suitable tasks where this comparison would be meaningful? A few options are: a) question-answering task (short input and output tokens) b) summarization task (long input and short output) c) code-generation (short input and long output). It would be great to know any preferred task / dataset to start with.
  2. Could you please let me know any other metrics (runtime, memory usage, performance) that would be of interest to the community?

Thank you.

PS: I found a paper addressing the same issue: https://arxiv.org/pdf/2312.12148

sirish-gambhira avatar Feb 15 '25 01:02 sirish-gambhira

Sure @sirish-gambhira , lets connect over discord to discuss this further my user name is imcoza

imcoza avatar Feb 17 '25 05:02 imcoza

Thanks for the interest. We plan to do a mini sprint next week to make progress on this task. We will share any findings where the community could help contributing.

PS: I found a paper addressing the same issue: https://arxiv.org/pdf/2312.12148

Note that our goal is a bit different. We don't want to have a comprehensive literature overview of various methods on numerous datasets -- we wouldn't have the resources for that. Our goal is less academic and more practical: Giving PEFT users a simple starting guide to choose the method that is most likely to suite their needs.

BenjaminBossan avatar Feb 17 '25 10:02 BenjaminBossan

That would be great. I conducted a comparative study evaluating AdaLoRA, IA3, and LoRA for text generation tasks using a multi-step conversational dataset. The results indicate that IA3 outperforms the other methods, demonstrating superior efficiency by achieving convergence in fewer fine-tuning steps. Looking forward to see your findings.

imcoza avatar Feb 17 '25 10:02 imcoza

That would be great. I conducted a comparative study evaluating AdaLoRA, IA3, and LoRA for text generation tasks using a multi-step conversational dataset. The results indicate that IA3 outperforms the other methods, demonstrating superior efficiency by achieving convergence in fewer fine-tuning steps. Looking forward to see your findings.

Nice to hear that. I personally believe that users should also give other methods a chance instead of focusing only on LoRA. If you have anything to share, feel free to do so.

BenjaminBossan avatar Feb 17 '25 12:02 BenjaminBossan

That would be great. I conducted a comparative study evaluating AdaLoRA, IA3, and LoRA for text generation tasks using a multi-step conversational dataset. The results indicate that IA3 outperforms the other methods, demonstrating superior efficiency by achieving convergence in fewer fine-tuning steps. Looking forward to see your findings.

Nice to hear that. I personally believe that users should also give other methods a chance instead of focusing only on LoRA. If you have anything to share, feel free to do so.

Based on my comparative study, I fine-tuned Llama 3.1 8B Instruct on Hindi multi-turn dialogue conversational data using Lora,IA3 and Adalora to evaluate their performance. IA3 demonstrated a smoother and more stable convergence, making it well-suited for resource-constrained environments. It maintained a smaller generalization gap, effectively reducing the risk of overfitting. In contrast, Adalora exhibited an aggressive initial optimization phase, causing a sharp drop in loss, but ultimately achieved a lower validation loss, indicating superior generalization. However, this rapid convergence introduces higher variance, necessitating careful learning rate tuning. If stability and robustness are the primary concerns, IA3 is the better choice. But if the objective is to maximize generalization and long-term adaptability, Adalora emerges as the superior method—provided there is sufficient training time and room for fine-tuning adjustments.

imcoza avatar Mar 03 '25 05:03 imcoza

Thanks for sharing your findings @imcoza. Are the code and the results publicly available?

We're still working on the method comparison framework, the progress is looking good so far. You can check for instance this PR.

BenjaminBossan avatar Mar 03 '25 10:03 BenjaminBossan

@BenjaminBossan Hi! Is this issue still open? I would love to contribute especially in improving the documentation for PEFT. You had mentioned about adding details on the best uses cases, pros and cons. Is it possible for me to work on this?

laurenf3395 avatar Mar 08 '25 02:03 laurenf3395

I would love to contribute especially in improving the documentation for PEFT. You had mentioned about adding details on the best uses cases, pros and cons. Is it possible for me to work on this?

Yes, for sure, thanks for offering. If you want to work on this, we're happy to accept contributions.

BenjaminBossan avatar Mar 10 '25 10:03 BenjaminBossan

Good news everyone, we made big progress on this project by merging #2395. This adds a PEFT method comparison suite that should allow us to compare different PEFT methods in an objective and replicable way, log and store the results, and present the results to allow PEFT users to make informed decisions about what methods they want to use.

The framework for this is set up, starting with a task that involves training on the MetaMathQA dataset and evaluating on GSM8K. We only have preliminary results at the moment, as we need the right hardware setup for final results, but this already looks promising. Here is a sneak peak at the included Gradio app:

Image

To find out more about this specific task, check the task README. If you want to know how you could contribute, we wrote down a contribution guide too.

As mentioned, this is not finished yet. Some steps we have yet to take:

  1. Generate the final results, probably using some cloud setup and task scheduling scripts.
  2. Create an automatic deployment of the Gradio app.
  3. Add more experiment settings for the different PEFT methods.

Let us know what you think about this, if you have any suggestions to improve the framework, or if you're interested in contributing.

PS: The points mentioned above, like the need for better documentation, are still valid.

BenjaminBossan avatar Mar 27 '25 16:03 BenjaminBossan

Dear Sir @BenjaminBossan , may I know where can we find the results in the first figure? Which file saves the metrics?

cyaaronk avatar Apr 14 '25 08:04 cyaaronk

@cyaaronk The results are not shared yet. We're working on generating them in a consistent fashion. Once that's done, we plan to deploy a Hugging Face space for everyone to inspect the results.

BenjaminBossan avatar Apr 14 '25 09:04 BenjaminBossan

@BenjaminBossan Thank you for the hard work and effort! I wonder if some initial results can be released first as I am testing some methods already but I am not sure if the results are correct.

cyaaronk avatar Apr 15 '25 03:04 cyaaronk

@cyaaronk We don't want to share the results prematurely in case we notice some errors. But it's safe to say that LoRA is still a good choice in terms of memory efficiency and performance. On top of that, it's the most feature complete (e.g. supporting many quantization methods). If you're installing PEFT from main and are using AdamW, give LoRA-FA a try (increase the rank r compared to normal LoRA). Apart from LoRA, our tests also show very good results for Bone in terms of memory and performance.

BenjaminBossan avatar Apr 15 '25 11:04 BenjaminBossan

Hi everyone, thanks for your patience. We finally ran all PEFT methods through the MetaMathQA experiment 🚀 🎉

The results can be found here. We're still actively working on this, so expect updated soon, but these results should be robust. We also provided a Gradio app to analyze the results, find the instructions to start it locally here. We will create a HF Space for this soon.

Now comes the part where we need the help of the community: Right now, most PEFT methods besides LoRA were run with the default configuration. This may or may not be the best settings. If you have a suggestion for better settings for those PEFT methods, please open a PR and contribute your own experiment. We provided instructions on how to do that.

For now, we aim to have ~2 experiments per PEFT method, one with the default settings and one with more optimized settings. Let's not go overboard with dozens of settings, which are expensive to run and may result in "overfitting to the test set". I hope to see many contributions!

BenjaminBossan avatar Jun 19 '25 16:06 BenjaminBossan

Small update, we have deployed a Gradio app that allows everyone to easily inspect the results:

https://huggingface.co/spaces/peft-internal-testing/PEFT-method-comparison

BenjaminBossan avatar Jul 01 '25 14:07 BenjaminBossan

Hello @BenjaminBossan,

I’ve been dedicating my recent research to LoRA and its variants, and I'm eager to make my first open-source contribution right here in the PEFT project.

Since your last update three months ago, I wanted to check in: What specific areas of the PEFT method comparison suite or documentation improvement are most urgently in need of community contributions right now?

My background could allow me to focus on:

  1. Optimization Configurations: Providing optimized hyperparameter settings for a non-LoRA method (e.g., IA3, Bone, AdaLoRA) to boost its performance on the MetaMathQA benchmark.
  2. Documentation Enhancement: Deeply reviewing and improving the documentation 'blurbs' for a specific method to clearly outline its pros, cons, and ideal use cases.

Could you please let me know which specific method or task would be the best place for me to start? I look forward to your guidance!"

yuhongwang-xd avatar Oct 08 '25 13:10 yuhongwang-xd

Hey @yuhongwang-xd, thanks for the offer. Regarding your proposals:

  1. Right now, for most PEFT methods, we only have a single experimental setting, which most of the times corresponds to the default values. I think for a fair comparison, we can generally aim for a second setting with more optimized hyper-parameters. I wouldn't go overboard with hyper-parameter tuning, as that runs the risk of overfitting on the test set. But if you are willing to assign a certain, capped budget per PEFT method to find better hyper-parameters, it would still be quite fair.
  2. That would also be valuable. In more recent PRs, we try to focus a bit more on making this information salient, but for most methods, there is room for improvement. As a reader of the docs, it would currently still be quite hard to choose the best methods for my use case.

If you want to contribute on any of these two areas, I would suggest to start with a single PEFT method to see how it goes.

BenjaminBossan avatar Oct 08 '25 14:10 BenjaminBossan

Hello @BenjaminBossan,

Thanks for your guidance regarding contributing to PEFT. I’d like to start with the hyperparameter optimization–related experimental supplement task.

My understanding is that the goal is not necessarily to find configurations that outperform the defaults, but to enrich the comparison table by testing additional hyperparameter combinations for a single PEFT method (e.g., AdaLoRA, LoHA). This would make it easier to see how changes in key parameters (like r) impact metrics such as test accuracy and accelerator memory.

I have prior experience experimenting with different parameter settings, which allows me to efficiently explore new configurations without unnecessary full-scale searches. Once this experimental supplement is complete, I’m also happy to help improve the documentation using insights from the experiments.

Could you confirm if this plan aligns with current project needs? Also, would you recommend creating a dedicated issue for this work, or should I proceed directly with a PR once I have results?

yuhongwang-xd avatar Oct 08 '25 15:10 yuhongwang-xd

My understanding is that the goal is not necessarily to find configurations that outperform the defaults, but to enrich the comparison table by testing additional hyperparameter combinations for a single PEFT method (e.g., AdaLoRA, LoHA). This would make it easier to see how changes in key parameters (like r) impact metrics such as test accuracy and accelerator memory.

A big focus should actually be to improve the score (the test accuracy). However, it shouldn't come at the cost of too high memory usage or compute. So just increasing the rank could trivially increase performance, but most users would rather have a nice trade off. Therefore, I'd try to increase the test accuracy while maintaining reasonable memory and compute/runtime profiles.

Of course, this all a bit subjective and may depend on the method. Just as an example, one of the selling points of VeRA is that it is especially parameter efficient. Therefore, when testing this method, I'd focus on keeping a low count of trainable parameters more so than when testing other PEFT methods.

I have prior experience experimenting with different parameter settings, which allows me to efficiently explore new configurations without unnecessary full-scale searches.

I think it would also be beneficial if you could shortly explain your approach. This way, other users who may want to try something similar can work based on your description. For now, you could add the description to the PR once it's ready.

would you recommend creating a dedicated issue for this work, or should I proceed directly with a PR once I have results?

Once you have results, just go ahead and create a PR, I see no need for a separate issue. Remember that we will do the final run on our machine to keep the comparison fair, but you can still attach your experimental logs so that we know what to expect.

BenjaminBossan avatar Oct 08 '25 15:10 BenjaminBossan

Thanks, @BenjaminBossan ! Understood. I’ll start with AdaLoRA and aim to improve accuracy while keeping memory and runtime reasonable. Once I have results and logs ready, I’ll open a PR with a short description of the optimization approach.

yuhongwang-xd avatar Oct 09 '25 07:10 yuhongwang-xd

Great, thanks @yuhongwang-xd. Just a note, AdaLoRA might actually be quite hard to optimize because it is a bit special with it's different training phases. Maybe you can start with another PEFT method (say, LoHa or LoKr) and save AdaLoRA for later.

BenjaminBossan avatar Oct 09 '25 09:10 BenjaminBossan

@BenjaminBossan Thanks a lot for the suggestion! I’ll start with LoHa to get things running smoothly, and once I have some stable results, I’ll explore AdaLoRA and other PEFT methods.

yuhongwang-xd avatar Oct 09 '25 13:10 yuhongwang-xd

@BenjaminBossan I would also like to contribute. Which one can I take?

rp440 avatar Oct 23 '25 00:10 rp440

@rp440 Are there any specific topics you're interested in? Maybe you can coordinate with @yuhongwang-xd on this.

As a more general note, following this blog post, I did some experiments (#2845) with LoRA targeting the MLP layers instead of attention layers (which are the defaults in PEFT for most models) and this did indeed increase performance, but at the cost of higher memory. Still, it could be worth exploring if other PEFT methods benefit equally.

BenjaminBossan avatar Oct 23 '25 09:10 BenjaminBossan