feat:merge-lora iterate through bins without loading
Description
feature 1 : The merge-lora script does not load the model into memory, period. It just iterates through each of the bin or safetensors shards and applies the lora to each module as it needs. It's extremely efficient compared to the standard approach.
new file lora_merge_efficient core implementation
new parameter merge_method : standard /memory efficient
Motivation and Context
#1679
references
qlora-pipe/tools/merge_lora.py
Tests
tested with examples/llama-3/qlora-1b.yml with tiny llama 1 b instruct and merge_methode:memory efficient
Summary by CodeRabbit
-
New Features
- Adds a memory-efficient LoRA merge that processes model shards without loading the full model; includes a legacy in-memory merge fallback when needed.
-
Chores
- Configurable merge method (default: memory_efficient), improved logging (method choice and per-shard progress), clearer CLI messaging, and safer merged output handling.
-
Documentation
- Updated config schema and docstrings to describe both merge strategies; public API unchanged.
[!IMPORTANT]
Review skipped
Auto incremental reviews are disabled on this repository.
Please check the settings in the CodeRabbit UI or the
.coderabbit.yamlfile in this repository. To trigger a single review, invoke the@coderabbitai reviewcommand.You can disable this status message by setting the
reviews.review_statustofalsein the CodeRabbit configuration file.
📝 Walkthrough
Walkthrough
Adds a shard-wise, memory-efficient LoRA merging utility and integrates it into the CLI with a dispatch that prefers the memory-efficient method (default) and falls back to the legacy in-memory merge on RuntimeError; also adds a merge_method PEFT config field defaulting to "memory_efficient".
Changes
| Cohort / File(s) | Change Summary |
|---|---|
CLI merge dispatch & helperssrc/axolotl/cli/merge_lora.py |
Add merge_method handling (default "memory_efficient"); log chosen method; import merge_lora_sharded_efficient; implement _do_merge_lora_legacy (in-memory) and _do_merge_lora_efficient (shard-wise); update do_merge_lora to dispatch with a RuntimeError fallback to legacy; adjust CLI messages and docstring. |
Memory-efficient LoRA merge utilitysrc/axolotl/utils/lora_merge_efficient.py |
New module implementing get_model_shards, find_lora_weights, copy_non_model_files, and merge_lora_sharded_efficient. Supports .safetensors and .bin shards, reads adapter config to compute scale, applies per-shard LoRA deltas without loading the full model, preserves safetensors metadata when possible, copies non-model files, and performs per-shard memory cleanup and logging. |
PEFT schema updatesrc/axolotl/utils/schemas/peft.py |
Add `merge_method: Literal["legacy","memory_efficient"] |
Estimated code review effort
🎯 4 (Complex) | ⏱️ ~45 minutes
✨ Finishing touches
🧪 Generate unit tests
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
Codecov Report
:x: Patch coverage is 14.81481% with 138 lines in your changes missing coverage. Please review.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/axolotl/cli/utils/lora_merge.py | 12.58% | 125 Missing :warning: |
| src/axolotl/cli/merge_lora.py | 23.52% | 13 Missing :warning: |
:loudspeaker: Thoughts on this report? Let us know!
curious if you have any numbers on how much peak VRAM is saved?
benchmarks coming soon
@ved1beta could you also ensure the weights/logits produced by a model which was merged using the legacy vs. memory efficient method are identical?
this should be insured with the test run ?
tested with examples/llama-3/qlora-1b.yml with tiny llama 1 b instruct and merge_methode:memory efficient
Were you able to train a lora, and then merge using both the legacy and memory efficient methods to verify identical merged weights from both methods?
yes tried merging and everything as you mentioned earlier here is the taining output [slacklink]( https://ai-axolotl.slack.com/files/U09BE3G7ZED/F09BNKLDDNZ/untitled?origin_team=T05A3MTMVB8&origin_channel=D09BE3HMM7B )
i have a claude generated script for testing identical model weights it passes for the given checkpoint generated from the training
memory usage for both of the merges calculated with a simple test script
Memory-Efficient Method
• Peak GPU Memory: 300 MB
• Peak CPU Memory: 14.4 MB
• Execution Time: 12.0 seconds
Legacy Method
• Peak GPU Memory: 2,914 MB
• Peak CPU Memory: 14.4 MB
• Execution Time: 15.9 seconds