Fine-tuning bitnet-b1.58-2B-4T-bf16 on Korean dataset results in high loss (~3.3–3.6)
Hi, and thanks for the great work on BitNet!
I'm trying to fine-tune microsoft/bitnet-b1.58-2B-4T-bf16 using a Korean dataset (nlpai-lab/kullm-v2) with SFTTrainer.
However, during training, the loss remains around 3.3 to 3.6, and doesn't decrease significantly. Is this expected for Korean fine-tuning?
Here’s a summary of my setup:
- Model: microsoft/bitnet-b1.58-2B-4T-bf16
- Dataset: nlpai-lab/kullm-v2 (Korean dataset: https://huggingface.co/datasets/nlpai-lab/kullm-v2)
- Trainer: SFTTrainer
- Loss remains high even after hundreds of steps.
Fine-tuning code:
!pip install trl
!pip install transformers accelerate
import torch
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
torch.cuda.empty_cache()
dataset = load_dataset("nlpai-lab/kullm-v2", split="train")
def preprocess(example):
instruction = example.get("instruction", "")
input_text = example.get("input", "")
output = example.get("output", "")
if input_text:
prompt = f"<|user|>\n{instruction}\n{input_text}\n<|assistant|>\n{output}"
else:
prompt = f"<|user|>\n{instruction}\n<|assistant|>\n{output}"
return {"text": prompt}
dataset = dataset.map(preprocess)
model_name = "microsoft/bitnet-b1.58-2B-4T-bf16"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.bfloat16
)
model.resize_token_embeddings(len(tokenizer))
training_args = SFTConfig(
max_seq_length=512,
output_dir="./bitnet-korean-finetuned",
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
learning_rate=5e-7,
num_train_epochs=3,
logging_steps=10,
save_steps=500,
bf16=True,
dataset_text_field="text",
packing=True,
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=training_args,
)
print("--- BitNet started fine-tuning ---")
trainer.train()
print("--- BitNet finished fine-tuning ---")
Is this learning rate too low? Or does BitNet require additional preprocessing for non-English languages?
Thanks in advance!
I wonder about this too. I think they don't train using 16bit, but directly using 1.58bit. Wish the code for training are shared.
@kwak513 Not sure if it'll help, I was trying to replicate onebitllms script for a custom text data, faced the same problem where loss doesn't decrease. I changed the training arguments and this was resolved, these are the arguments that are different from yours :
training_args = SFTConfig(
per_device_train_batch_size=16,
gradient_accumulation_steps=16,
learning_rate=1e-4, # Taken from official example;
packing=True,
)
And i removed bf16=True, .