Very slow training speed with CURLoRA on Llama 3.1 8B Instruct
I am currently fine-tuning the Llama 3.1 8B Instruct model using CURLoRA adapters on a single RTX 4090 GPU.
Problem:
-
It takes ~170 seconds per step (batch) during training.
-
Estimated time to complete one epoch is over 14 days.
-
Estimated full 5-epoch training would take around 2+ months at current speed.
-
the process crashes halfway through.
Question:
-
Is this extremely slow training expected when fine-tuning Llama 3.1 8B models with CURLoRA on a 4090?
-
Is there anything I can optimize further while still using CURLoRA? (e.g., sequence length, optimizer settings, etc.)
Additional Notes:
-
GPU utilization is high (close to 100%) during training.
-
VRAM usage is around 22.5 GB out of 24 GB (4090 almost fully loaded).
Yes I understand your issue and it is valid. CURLoRA isn't neither optimized nor supporting quantization. The slowness comes mainly from the three matrix multiplications that happen among the C, U and R matrices and because the code doesn't support quantization it may make it slower. The whole purpose of the research (initially) was towards mitigating the catastrophic forgetting and to show how this is possible theoretically and mathematically via CURLoRA while in the future to address the optimization issue. The main obstacle now is that I recently don't have time to work on the code optimization unfortunately but will do so once I have time. So for now please feel free to create a PR with a solution that you feel helphul. That would be highly appreciated. Thanks a lot for raising this and for reaching out.
Hi, thanks so much for your detailed explanation and for clarifying the current state of optimization and quantization in CURLoRA. I wanted to let you know that adding the following lines really helped me train the model despite the memory constraints:
model.enable_input_require_grads()
model.gradient_checkpointing_enable()
model.config.use_cache = False
These reduced VRAM usage significantly during training, allowing me to run the experiments on my hardware. I also noticed that without model.enable_input_require_grads(), I wasn’t able to use gradient checkpointing properly—though I’m not entirely sure why it’s required in this case.
I understand this doesn’t address the underlying optimization issues you mentioned, but it made a practical difference for now. I appreciate all the work you’ve put into this project! Looking forward to future updates!
Thank you so much for your comment and feedback I really appreciate it and thanks a lot for sharing your fix. I will keep the issue open so that whenever I work on the optimization you are updated. Thanks
Other than that, can I ask for advice on training a 22k sample dataset on an 8B model? Since using CurLoRA will freezes the base model, will it require more epochs or the curlora rank to effectively learn the new knowledge?
Sorry, can I ask how to perform inference with the saved model or checkpoint? Because when I use it directly, it shows random results
here are the code i using
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# ── 1. Load tokenizer & model ───────────────────────────────────────────────────
model_id = "final_curlora_merged_model_rank16"
print("Loading model – please wait …")
tokenizer = AutoTokenizer.from_pretrained(model_id, local_files_only=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
device_map="auto",
)
print("Model loaded ✔")
# ── 2. The questions you want answered ─────────────────────────────────────────
sample_questions = [
"What is the second OWASP 2023 Top 10 API vulnerability?",
"What is MITRE ATT&CK technique T1566.001, and how can organizations defend against it?",
]
# (Optional) system instruction – prepend this if your model benefits from it.
system_msg = {
"role": "system",
"content": (
"You are a helpful and knowledgeable cybersecurity assistant. "
"You only answer questions related to cybersecurity. If a question is unclear "
"or outside your expertise, respond with 'I don't know'. Do not hallucinate."
)
}
# ── 3. Ask each question and print the answer ──────────────────────────────────
for i, q in enumerate(sample_questions, 1):
# Build a chat-formatted prompt → tensor
prompt = tokenizer.apply_chat_template(
[system_msg, {"role": "user", "content": q}],
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
# Generate the answer
with torch.no_grad():
out = model.generate(
prompt,
max_new_tokens=256,
temperature=0.7,
do_sample=True,
top_p=0.9,
pad_token_id=tokenizer.eos_token_id
)
# Decode *only* the new text (skip the prompt)
answer = tokenizer.decode(out[0][prompt.shape[-1]:], skip_special_tokens=True)
print(f"\n🧠 Response {i}:\n{answer}\n")
Sorry, can I ask how to perform inference with the saved model or checkpoint? Because when I use it directly, it shows random results
Did your training converge ?