BitNet Fine-tuning bitnet-b1.58-2B-4T-bf16 on Korean dataset results in high loss (~3.3

Hi, and thanks for the great work on BitNet!

I'm trying to fine-tune microsoft/bitnet-b1.58-2B-4T-bf16 using a Korean dataset (nlpai-lab/kullm-v2) with SFTTrainer. However, during training, the loss remains around 3.3 to 3.6, and doesn't decrease significantly. Is this expected for Korean fine-tuning?

Here’s a summary of my setup:

Model: microsoft/bitnet-b1.58-2B-4T-bf16
Dataset: nlpai-lab/kullm-v2 (Korean dataset: https://huggingface.co/datasets/nlpai-lab/kullm-v2)
Trainer: SFTTrainer
Loss remains high even after hundreds of steps.

Fine-tuning code:



!pip install trl
!pip install transformers accelerate 

import torch
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments


torch.cuda.empty_cache()

dataset = load_dataset("nlpai-lab/kullm-v2", split="train")

def preprocess(example):
    instruction = example.get("instruction", "")
    input_text = example.get("input", "")
    output = example.get("output", "")

    if input_text:
        prompt = f"<|user|>\n{instruction}\n{input_text}\n<|assistant|>\n{output}"
    else:
        prompt = f"<|user|>\n{instruction}\n<|assistant|>\n{output}"

    return {"text": prompt}

dataset = dataset.map(preprocess)

model_name = "microsoft/bitnet-b1.58-2B-4T-bf16"

tokenizer = AutoTokenizer.from_pretrained(model_name)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

model.resize_token_embeddings(len(tokenizer))


training_args = SFTConfig(
    max_seq_length=512,
    output_dir="./bitnet-korean-finetuned",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=5e-7,
    num_train_epochs=3,
    logging_steps=10,
    save_steps=500,
    bf16=True,
    dataset_text_field="text", 
    packing=True, 
)


trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
)

print("--- BitNet started fine-tuning ---")
trainer.train()
print("--- BitNet finished fine-tuning  ---")

Is this learning rate too low? Or does BitNet require additional preprocessing for non-English languages?

Thanks in advance!

Jun 08 '25 08:06 kwak513

I wonder about this too. I think they don't train using 16bit, but directly using 1.58bit. Wish the code for training are shared.

Jun 10 '25 08:06 hndrbrm

@kwak513 Not sure if it'll help, I was trying to replicate onebitllms script for a custom text data, faced the same problem where loss doesn't decrease. I changed the training arguments and this was resolved, these are the arguments that are different from yours :

training_args = SFTConfig(
    per_device_train_batch_size=16,
    gradient_accumulation_steps=16,
    learning_rate=1e-4,       # Taken from official example;
    packing=True,  
)

And i removed bf16=True, .

Oct 04 '25 13:10 Just-Another-Damned-Coder

Fine-tuning bitnet-b1.58-2B-4T-bf16 on Korean dataset results in high loss (~3.3–3.6)

Fine-tuning code: