Description

feature 1 : The merge-lora script does not load the model into memory, period. It just iterates through each of the bin or safetensors shards and applies the lora to each module as it needs. It's extremely efficient compared to the standard approach.

new file lora_merge_efficient core implementation new parameter merge_method : standard /memory efficient

Motivation and Context

#1679

references

qlora-pipe/tools/merge_lora.py

Tests

tested with examples/llama-3/qlora-1b.yml with tiny llama 1 b instruct and merge_methode:memory efficient

Summary by CodeRabbit

New Features
- Adds a memory-efficient LoRA merge that processes model shards without loading the full model; includes a legacy in-memory merge fallback when needed.
Chores
- Configurable merge method (default: memory_efficient), improved logging (method choice and per-shard progress), clearer CLI messaging, and safer merged output handling.
Documentation
- Updated config schema and docstrings to describe both merge strategies; public API unchanged.

Aug 21 '25 19:08 ved1beta

[!IMPORTANT]

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Adds a shard-wise, memory-efficient LoRA merging utility and integrates it into the CLI with a dispatch that prefers the memory-efficient method (default) and falls back to the legacy in-memory merge on RuntimeError; also adds a merge_method PEFT config field defaulting to "memory_efficient".

Changes

Cohort / File(s)	Change Summary
CLI merge dispatch & helpers `src/axolotl/cli/merge_lora.py`	Add `merge_method` handling (default `"memory_efficient"`); log chosen method; import `merge_lora_sharded_efficient`; implement `_do_merge_lora_legacy` (in-memory) and `_do_merge_lora_efficient` (shard-wise); update `do_merge_lora` to dispatch with a `RuntimeError` fallback to legacy; adjust CLI messages and docstring.
Memory-efficient LoRA merge utility `src/axolotl/utils/lora_merge_efficient.py`	New module implementing `get_model_shards`, `find_lora_weights`, `copy_non_model_files`, and `merge_lora_sharded_efficient`. Supports `.safetensors` and `.bin` shards, reads adapter config to compute scale, applies per-shard LoRA deltas without loading the full model, preserves safetensors metadata when possible, copies non-model files, and performs per-shard memory cleanup and logging.
PEFT schema update `src/axolotl/utils/schemas/peft.py`	Add `merge_method: Literal["legacy","memory_efficient"]

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

✨ Finishing touches

🧪 Generate unit tests

[ ] Create PR with unit tests
[ ] Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Aug 21 '25 19:08 coderabbitai[bot]

Codecov Report

:x: Patch coverage is 14.81481% with 138 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/cli/utils/lora_merge.py	12.58%	125 Missing :warning:
src/axolotl/cli/merge_lora.py	23.52%	13 Missing :warning:

:loudspeaker: Thoughts on this report? Let us know!

Aug 22 '25 11:08 codecov[bot]

curious if you have any numbers on how much peak VRAM is saved?

Aug 22 '25 13:08 djsaunde

benchmarks coming soon

Aug 22 '25 15:08 ved1beta

@ved1beta could you also ensure the weights/logits produced by a model which was merged using the legacy vs. memory efficient method are identical?

Aug 22 '25 15:08 SalmanMohammadi

this should be insured with the test run ?

tested with examples/llama-3/qlora-1b.yml with tiny llama 1 b instruct and merge_methode:memory efficient

Aug 22 '25 17:08 ved1beta

Were you able to train a lora, and then merge using both the legacy and memory efficient methods to verify identical merged weights from both methods?

Aug 28 '25 16:08 winglian

yes tried merging and everything as you mentioned earlier here is the taining output [slacklink]( https://ai-axolotl.slack.com/files/U09BE3G7ZED/F09BNKLDDNZ/untitled?origin_team=T05A3MTMVB8&origin_channel=D09BE3HMM7B )

i have a claude generated script for testing identical model weights it passes for the given checkpoint generated from the training

Aug 29 '25 08:08 ved1beta

memory usage for both of the merges calculated with a simple test script

Memory-Efficient Method
• Peak GPU Memory: 300 MB
• Peak CPU Memory: 14.4 MB
• Execution Time: 12.0 seconds

Legacy Method
• Peak GPU Memory: 2,914 MB
• Peak CPU Memory: 14.4 MB
• Execution Time: 15.9 seconds

Sep 07 '25 08:09 ved1beta

feat:merge-lora iterate through bins without loading

Description

Motivation and Context

references

Tests

Summary by CodeRabbit

Review skipped

Walkthrough

Changes

Estimated code review effort

Codecov Report