axolotl icon indicating copy to clipboard operation
axolotl copied to clipboard

feat:merge-lora iterate through bins without loading

Open ved1beta opened this issue 4 months ago • 9 comments

Description

feature 1 : The merge-lora script does not load the model into memory, period. It just iterates through each of the bin or safetensors shards and applies the lora to each module as it needs. It's extremely efficient compared to the standard approach.

new file lora_merge_efficient core implementation new parameter merge_method : standard /memory efficient

Motivation and Context

#1679

references

qlora-pipe/tools/merge_lora.py

Tests

tested with examples/llama-3/qlora-1b.yml with tiny llama 1 b instruct and merge_methode:memory efficient

Summary by CodeRabbit

  • New Features

    • Adds a memory-efficient LoRA merge that processes model shards without loading the full model; includes a legacy in-memory merge fallback when needed.
  • Chores

    • Configurable merge method (default: memory_efficient), improved logging (method choice and per-shard progress), clearer CLI messaging, and safer merged output handling.
  • Documentation

    • Updated config schema and docstrings to describe both merge strategies; public API unchanged.

ved1beta avatar Aug 21 '25 19:08 ved1beta

[!IMPORTANT]

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Adds a shard-wise, memory-efficient LoRA merging utility and integrates it into the CLI with a dispatch that prefers the memory-efficient method (default) and falls back to the legacy in-memory merge on RuntimeError; also adds a merge_method PEFT config field defaulting to "memory_efficient".

Changes

Cohort / File(s) Change Summary
CLI merge dispatch & helpers
src/axolotl/cli/merge_lora.py
Add merge_method handling (default "memory_efficient"); log chosen method; import merge_lora_sharded_efficient; implement _do_merge_lora_legacy (in-memory) and _do_merge_lora_efficient (shard-wise); update do_merge_lora to dispatch with a RuntimeError fallback to legacy; adjust CLI messages and docstring.
Memory-efficient LoRA merge utility
src/axolotl/utils/lora_merge_efficient.py
New module implementing get_model_shards, find_lora_weights, copy_non_model_files, and merge_lora_sharded_efficient. Supports .safetensors and .bin shards, reads adapter config to compute scale, applies per-shard LoRA deltas without loading the full model, preserves safetensors metadata when possible, copies non-model files, and performs per-shard memory cleanup and logging.
PEFT schema update
src/axolotl/utils/schemas/peft.py
Add `merge_method: Literal["legacy","memory_efficient"]

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

✨ Finishing touches
🧪 Generate unit tests
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Aug 21 '25 19:08 coderabbitai[bot]

Codecov Report

:x: Patch coverage is 14.81481% with 138 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/axolotl/cli/utils/lora_merge.py 12.58% 125 Missing :warning:
src/axolotl/cli/merge_lora.py 23.52% 13 Missing :warning:

:loudspeaker: Thoughts on this report? Let us know!

codecov[bot] avatar Aug 22 '25 11:08 codecov[bot]

curious if you have any numbers on how much peak VRAM is saved?

djsaunde avatar Aug 22 '25 13:08 djsaunde

benchmarks coming soon

ved1beta avatar Aug 22 '25 15:08 ved1beta

@ved1beta could you also ensure the weights/logits produced by a model which was merged using the legacy vs. memory efficient method are identical?

SalmanMohammadi avatar Aug 22 '25 15:08 SalmanMohammadi

this should be insured with the test run ?

tested with examples/llama-3/qlora-1b.yml with tiny llama 1 b instruct and merge_methode:memory efficient

ved1beta avatar Aug 22 '25 17:08 ved1beta

Were you able to train a lora, and then merge using both the legacy and memory efficient methods to verify identical merged weights from both methods?

winglian avatar Aug 28 '25 16:08 winglian

yes tried merging and everything as you mentioned earlier here is the taining output [slacklink]( https://ai-axolotl.slack.com/files/U09BE3G7ZED/F09BNKLDDNZ/untitled?origin_team=T05A3MTMVB8&origin_channel=D09BE3HMM7B )

i have a claude generated script for testing identical model weights it passes for the given checkpoint generated from the training

ved1beta avatar Aug 29 '25 08:08 ved1beta

memory usage for both of the merges calculated with a simple test script

Memory-Efficient Method
• Peak GPU Memory: 300 MB
• Peak CPU Memory: 14.4 MB
• Execution Time: 12.0 seconds

Legacy Method
• Peak GPU Memory: 2,914 MB
• Peak CPU Memory: 14.4 MB
• Execution Time: 15.9 seconds

ved1beta avatar Sep 07 '25 08:09 ved1beta