LLaMA-Factory
LLaMA-Factory copied to clipboard
DPO format - Expected a string, got {}".format(value), got None
Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
5e6f808 format DPO Dataset
name_dpo.json:
[
{
"instruction": "Last Question",
"input": "",
"output": [
"Last Question, chosen answer",
"Last Question, rejected answer"
],
"history": [
[
"Hello",
"Hello!"
],
[
"Describe ... ",
"Answer2"
]
]
},
...
]
dataset_info.json:
"name_dpo": {
"file_name": "name_dpo.json",
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output",
"history": "history"
},
"ranking": true
},
Log
Generating train split: 305869 examples [00:14, 20697.01 examples/s]
Converting format of dataset: 100%|██████████████████████████████████| 100000/100000 [00:04<00:00, 23519.41 examples/s]
Running tokenizer on dataset: 7%|██▋ | 7000/100000 [00:29<06:34, 235.63 examples/s]
Traceback (most recent call last):
File "C:\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\LLaMA-Factory\venv\Scripts\llamafactory-cli.exe\__main__.py", line 7, in <module>
sys.exit(main())
File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\cli.py", line 33, in main
run_exp()
File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\train\tuner.py", line 41, in run_exp
run_orpo(model_args, data_args, training_args, finetuning_args, callbacks)
File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\train\orpo\workflow.py", line 29, in run_orpo
dataset = get_dataset(model_args, data_args, training_args, stage="rm", **tokenizer_module)
File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\loader.py", line 164, in get_dataset
dataset = dataset.map(preprocess_func, batched=True, remove_columns=column_names, **kwargs)
File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 602, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 3156, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 3547, in _map_single
batch = apply_function_on_filtered_inputs(
File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\preprocess.py", line 242, in preprocess_pairwise_dataset
_, rejected_ids = template.encode_oneturn(
File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\template.py", line 45, in encode_oneturn
encoded_pairs = self._encode(tokenizer, messages, system, tools, cutoff_len, reserved_label_len)
File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\template.py", line 94, in _encode
elements += self.format_assistant.apply(content=message["content"])
File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\formatter.py", line 116, in apply
raise RuntimeError("Expected a string, got {}".format(value))
RuntimeError: Expected a string, got None
- Removed empty strings and added blanks, converted any value to a string, removed possible
Nonevalues, and double-checked that only strings were left for every key. - I split the data into two halves, but it still consistently stops at
7%during theRunning tokenizerprocess. In the end, I created a workaround by modifying thepreprocess_pairwise_datasetfunction in thepreprocess.pyfileto handles cases wheremessage["content"]isNone. Ifmessage["content"]isNone, it is replaced with an empty string for bothchosen_messagesandrejected_messages.
preprocess_pairwise_dataset
def preprocess_pairwise_dataset(
examples: Dict[str, List[Any]],
template: "Template",
tokenizer: "PreTrainedTokenizer",
processor: Optional["ProcessorMixin"],
data_args: "DataArguments",
) -> Dict[str, List[List[int]]]:
# build input pairs with format `<bos> X`, `Y1 <eos>` and `Y2 <eos>`
model_inputs = {"prompt_ids": [], "chosen_ids": [], "rejected_ids": []}
if processor is not None:
model_inputs["pixel_values"] = []
preprocess_visual_inputs = partial(_preprocess_visual_inputs, processor=processor)
for i in range(len(examples["prompt"])):
if len(examples["prompt"][i]) % 2 != 1 or len(examples["response"][i]) < 2:
continue
if processor is not None:
examples["prompt"][i][0]["content"] = "<image>" + examples["prompt"][i][0]["content"]
chosen_messages = examples["prompt"][i] + [examples["response"][i][0]]
rejected_messages = examples["prompt"][i] + [examples["response"][i][1]]
# Check if message["content"] is None and replace it with an empty string/null
for message in rejected_messages:
if message["content"] is None:
message["content"] = "null"
prompt_ids, chosen_ids = template.encode_oneturn(
tokenizer,
chosen_messages,
examples["system"][i],
examples["tools"][i],
data_args.cutoff_len,
data_args.reserved_label_len,
)
_, rejected_ids = template.encode_oneturn(
tokenizer,
rejected_messages,
examples["system"][i],
examples["tools"][i],
data_args.cutoff_len,
data_args.reserved_label_len,
)
if template.efficient_eos:
chosen_ids += [tokenizer.eos_token_id]
rejected_ids += [tokenizer.eos_token_id]
model_inputs["prompt_ids"].append(prompt_ids)
model_inputs["chosen_ids"].append(chosen_ids)
model_inputs["rejected_ids"].append(rejected_ids)
if processor is not None:
model_inputs["pixel_values"].append(preprocess_visual_inputs(examples["images"][i]))
return model_inputs
A simple JSON check for LLaMA-Factory script would be useful, as I cannot identify line or format issue when use LLaMA-Factory dataset loader.
Edit1:
I'v printed all None. Only rejected_messages "content" cause issues, 1-2 was
"output": [
"Answer1 ....",
null
],
300+ Others are math/code (e.g. !is_none) characters or "content" not part of my dataset...?!