curlora icon indicating copy to clipboard operation
curlora copied to clipboard

Very slow training speed with CURLoRA on Llama 3.1 8B Instruct

Open NEWbie0709 opened this issue 8 months ago • 7 comments

I am currently fine-tuning the Llama 3.1 8B Instruct model using CURLoRA adapters on a single RTX 4090 GPU.

Image

Problem:

  • It takes ~170 seconds per step (batch) during training.

  • Estimated time to complete one epoch is over 14 days.

  • Estimated full 5-epoch training would take around 2+ months at current speed.

  • the process crashes halfway through.

Question:

  • Is this extremely slow training expected when fine-tuning Llama 3.1 8B models with CURLoRA on a 4090?

  • Is there anything I can optimize further while still using CURLoRA? (e.g., sequence length, optimizer settings, etc.)

Additional Notes:

  • GPU utilization is high (close to 100%) during training.

  • VRAM usage is around 22.5 GB out of 24 GB (4090 almost fully loaded).

NEWbie0709 avatar Apr 28 '25 07:04 NEWbie0709

Yes I understand your issue and it is valid. CURLoRA isn't neither optimized nor supporting quantization. The slowness comes mainly from the three matrix multiplications that happen among the C, U and R matrices and because the code doesn't support quantization it may make it slower. The whole purpose of the research (initially) was towards mitigating the catastrophic forgetting and to show how this is possible theoretically and mathematically via CURLoRA while in the future to address the optimization issue. The main obstacle now is that I recently don't have time to work on the code optimization unfortunately but will do so once I have time. So for now please feel free to create a PR with a solution that you feel helphul. That would be highly appreciated. Thanks a lot for raising this and for reaching out.

MNoorFawi avatar Apr 30 '25 17:04 MNoorFawi

Hi, thanks so much for your detailed explanation and for clarifying the current state of optimization and quantization in CURLoRA. I wanted to let you know that adding the following lines really helped me train the model despite the memory constraints:

model.enable_input_require_grads()
model.gradient_checkpointing_enable()
model.config.use_cache = False

These reduced VRAM usage significantly during training, allowing me to run the experiments on my hardware. I also noticed that without model.enable_input_require_grads(), I wasn’t able to use gradient checkpointing properly—though I’m not entirely sure why it’s required in this case.

I understand this doesn’t address the underlying optimization issues you mentioned, but it made a practical difference for now. I appreciate all the work you’ve put into this project! Looking forward to future updates!

NEWbie0709 avatar May 01 '25 13:05 NEWbie0709

Thank you so much for your comment and feedback I really appreciate it and thanks a lot for sharing your fix. I will keep the issue open so that whenever I work on the optimization you are updated. Thanks

MNoorFawi avatar May 01 '25 19:05 MNoorFawi

Other than that, can I ask for advice on training a 22k sample dataset on an 8B model? Since using CurLoRA will freezes the base model, will it require more epochs or the curlora rank to effectively learn the new knowledge?

NEWbie0709 avatar May 02 '25 04:05 NEWbie0709

Sorry, can I ask how to perform inference with the saved model or checkpoint? Because when I use it directly, it shows random results

Image

NEWbie0709 avatar May 05 '25 05:05 NEWbie0709

here are the code i using

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# ── 1.  Load tokenizer & model ───────────────────────────────────────────────────
model_id = "final_curlora_merged_model_rank16"         

print("Loading model – please wait …")
tokenizer = AutoTokenizer.from_pretrained(model_id, local_files_only=True)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    device_map="auto",
)
print("Model loaded ✔")

# ── 2.  The questions you want answered ─────────────────────────────────────────
sample_questions = [
    "What is the second OWASP 2023 Top 10 API vulnerability?",
    "What is MITRE ATT&CK technique T1566.001, and how can organizations defend against it?",
]

# (Optional) system instruction – prepend this if your model benefits from it.
system_msg = {
    "role": "system",
    "content": (
        "You are a helpful and knowledgeable cybersecurity assistant. "
        "You only answer questions related to cybersecurity. If a question is unclear "
        "or outside your expertise, respond with 'I don't know'. Do not hallucinate."
    )
}

# ── 3.  Ask each question and print the answer ──────────────────────────────────
for i, q in enumerate(sample_questions, 1):
    # Build a chat-formatted prompt → tensor
    prompt = tokenizer.apply_chat_template(
        [system_msg, {"role": "user", "content": q}],
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(model.device)

    # Generate the answer
    with torch.no_grad():
        out = model.generate(
            prompt,
            max_new_tokens=256,
            temperature=0.7,
            do_sample=True,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )

    # Decode *only* the new text (skip the prompt)
    answer = tokenizer.decode(out[0][prompt.shape[-1]:], skip_special_tokens=True)

    print(f"\n🧠 Response {i}:\n{answer}\n")

NEWbie0709 avatar May 05 '25 06:05 NEWbie0709

Sorry, can I ask how to perform inference with the saved model or checkpoint? Because when I use it directly, it shows random results

Image

Did your training converge ?

shiwanghua avatar Aug 07 '25 09:08 shiwanghua