unsloth
unsloth copied to clipboard
SFTTrainer doesn't work with some datasets due to column key error
Hello, I 've been trying to use the SFTTrainer with the vicgalle/alpaca-gpt4 dataset. However after prepping the dataset in the SFT format, I keep on getting this error when I initialize the trainer.
File /usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py:3025, in Dataset.map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc)
3023 missing_columns = set(remove_columns) - set(self._data.column_names)
3024 if missing_columns:
-> 3025 raise ValueError(
3026 f"Column to remove {list(missing_columns)} not in the dataset. Current columns in the dataset: {self._data.column_names}"
3027 )
3029 load_from_cache_file = load_from_cache_file if load_from_cache_file is not None else is_caching_enabled()
3031 if fn_kwargs is None:
ValueError: Column to remove ['train'] not in the dataset. Current columns in the dataset: ['instruction', 'input', 'output', 'text']
However, the the dataset only has the train split when I print it. This only occurs with some datasets, I suspect this maybe a bug.
@JohnnyRacer Oh wait you need to change train to text I think! It's only the train or test split, but rather the column which you want! ['instruction', 'input', 'output', 'text'] are your columns, and text is the column you want.
@danielhanchen Sorry I don't really follow what you mean since I already specified dataset_text_field="text" in the args when I inited the SFTTrainer instance. If you don't mind, can you clarify what I need to alter. Here is the snippet I am trying to run, adapted from this example on HF :
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import FastLanguageModel
from datasets import load_dataset, ClassLabel
dataset_path = "vicgalle/alpaca-gpt4"
target_dataset = load_dataset(dataset_path)
dataset = target_dataset["train"] # Has the columns : ['instruction', 'input', 'output', 'text']
model, tokenizer = FastLanguageModel.from_pretrained(model_name = "unsloth/mistral-7b",**load_cfg)
model = FastLanguageModel.get_peft_model(model, **lora_cfg)
trainer = SFTTrainer(
model = model,
train_dataset = dataset ,
dataset_text_field = "text", # I already specified that the 'text' column here
max_seq_length = max_seq_length,
tokenizer = tokenizer,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 10,
max_steps = 60,
bf16 = True,
logging_steps = 1,
output_dir = "outputs",
optim = "adamw_bnb_8bit"
),
)
trainer.train()
@JohnnyRacer I'll check it out! :)
@danielhanchen I think I have solved it, if I add packing=False to the training args the SFTTrainer seems to initialize and train fine.
This only occurs with some datasets, I suspect this maybe a bug. hey mind giving me an example of a dataset that works normally with the settings in your first message so I can reproduce?
@OneCodeToRuleThemAll I don't actually remember the exact dataset that worked since I was just testing a bunch of my own. I think its this one that worked. It seems like it the training split is generated automatically instead of being explicitly specified then packing=False is required to make the dataset load correctly. Hope this helps.