openfl icon indicating copy to clipboard operation
openfl copied to clipboard

Adding LLM support with peft +quantization (4bit and 8 bit)

Open rajithkrishnegowda opened this issue 8 months ago • 5 comments

Summary

Adding LLM support with peft + quantization(4bit and 8bit) using several open source frameworks to improve local training (PEFT) by referencing FlowerLLM methods.

Type of Change (Mandatory)

Specify the type of change being made.

  • Feature enhancement

Description (Mandatory)

This PR adds a new Jupyter notebook that demonstrates how to fine-tune Microsoft's Phi-4 model in a federated learning workflow using OpenFL. The notebook implements and compares both 4-bit and 8-bit quantization techniques along with Parameter-Efficient Fine-Tuning (PEFT) methods.

Key features of this notebook:

  • Implements federated fine-tuning of Phi-4 using OpenFL's experimental workflow interface
  • Applies QLoRA (Quantized Low-Rank Adaptation) for memory-efficient training
  • Provides comparative analysis between 4-bit and 8-bit quantization in terms of:
  • Memory usage across training phases
  • Training and evaluation loss metrics
  • Peak memory utilization
  • Includes visualization tools to analyze performance differences
  • Demonstrates gradient checkpointing and other memory optimization techniques
  • Uses the math_10k.json dataset for fine-tuning

This notebook is intended for the experimental workflow section of OpenFL tutorials, helping users understand the trade-offs between different quantization approaches when fine-tuning large language models in federated settings.

Testing

The notebook has been tested on a system with CUDA support. All cells execute successfully, and both 4-bit and 8-bit quantization workflows complete the federated training process. The notebook generates visualizations comparing memory usage and performance metrics between the two quantization approaches.

https://github.com/rajithkrishnegowda/openfl/blob/rajith-llm-quanti/openfl-tutorials/experimental/workflow/LLM/phi-4-peft-quantization.ipynb

rajithkrishnegowda avatar May 16 '25 05:05 rajithkrishnegowda

@rajithkrishnegowda Great Work. Can you also put metrics and plots for the baseline model, I dont think I see it currently. It will be really good to see the comparision.

rahulga1 avatar May 16 '25 10:05 rahulga1

@rajithkrishnegowda Great Work. Can you also put metrics and plots for the baseline model, I dont think I see it currently. It will be really good to see the comparison.

added in latest commit

rajithkrishnegowda avatar May 16 '25 10:05 rajithkrishnegowda

General comment on the training output cells: There is too much detail in these cells which, as a reader, doesn't tell me why and what is the goal of this notebook. As noted in previous reviews, notebooks are a hook for a user to gain familiarity, objective and outcomes of the framework without running any code.

Therefore, an ideal notebook:

  • Has to be short in length. Although this can be waived when tutorial itself covers many concepts.
  • Defines objectives early on (this is taken care of in this notebook).
  • Each cell must not exceed one scroll length worth of content, i.e. 15-20 lines - it fits the context window of the reader. If it does, consider breaking the cell.
  • Output must be brief - reader knows what to expect in the output cell (as if they are executing the notebook in their mind)
  • Conclusion must demonstrate that objectives defined at the beginning are met, through small, easy to parse plots.

Specific suggestions:

  1. Please shrink/reformat/reconsider how much output text is presented.
  2. Please run the experiment for at least 10 rounds to demonstrate a trendline. 2 rounds is not sufficient.
  1. separated cell for 4bit and 8 bit but output of those cells can't be restricted in the settings of visual code when we commit it can't be restricted. Please let me know if you have any ideas
  2. As per kevin's comment above since each round is a full epoch, it may not make sense to run too many more rounds but still ran 5 rounds.

rajithkrishnegowda avatar May 21 '25 02:05 rajithkrishnegowda

Additionally, as discussed offline, I'd suggest excluding self.model between the FL flow steps and only transmitting the PEFT/LoRA parameters instead. In fact, the rest of the model is anyway supposed to be "frozen" during fine-tuning, so one more reason not to transmit it entirely back and forth between the OpenFL nodes.

This will reduce unnecessary large object cloning within the LocalRuntime, but more importantly - it will enable running the notebook with FederatedRuntime, and also via TaskRunner API where we have a 2GB limitation on the protobuf message size.

teoparvanov avatar May 21 '25 14:05 teoparvanov

Great work on the PR! I have a few suggestions:

SFTTrainer in Cell 18: It seems like the SFTTrainer might not be necessary. Could you check if we can remove it to simplify the code?

Unused Imports: I noticed some unused imports in cells 8 and 23. Cleaning these up could improve the code's readability.

Optimizer Initialization: There are two sections for initializing the optimizer. Is the second one required, or can we consolidate them to avoid redundancy?

Memory Optimization: Since you're using localruntime, you might reduce memory usage by sending the model to the CPU before starting the collaborator steps. Loading it into the GPU for these steps and then sending it back to the CPU for the join step could be more efficient. Without that you are basically holding 3 models in memory at the same time, (1 for aggregator and 2 for collaborators)

Let me know what you think!

porteratzo avatar May 21 '25 19:05 porteratzo