DPO format - Expected a string, got {}".format(value), got None

Open Katehuuh opened this issue 1 year ago • 0 comments

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

5e6f808 format DPO Dataset

name_dpo.json:

[
  {
    "instruction": "Last Question",
    "input": "",
    "output": [
      "Last Question, chosen answer",
      "Last Question, rejected answer"
    ],
    "history": [
      [
        "Hello",
        "Hello!"
      ],
      [
        "Describe ... ",
        "Answer2"
      ]
    ]
  },
...
]

dataset_info.json:

  "name_dpo": {
    "file_name": "name_dpo.json",
    "columns": {
      "prompt": "instruction",
      "query": "input",
      "response": "output",
      "history": "history"
    },
    "ranking": true
  },

Log

Generating train split: 305869 examples [00:14, 20697.01 examples/s]
Converting format of dataset: 100%|██████████████████████████████████| 100000/100000 [00:04<00:00, 23519.41 examples/s]
Running tokenizer on dataset:   7%|██▋                                   | 7000/100000 [00:29<06:34, 235.63 examples/s]
Traceback (most recent call last):
  File "C:\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\LLaMA-Factory\venv\Scripts\llamafactory-cli.exe\__main__.py", line 7, in <module>
    sys.exit(main())
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\cli.py", line 33, in main
    run_exp()
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\train\tuner.py", line 41, in run_exp
    run_orpo(model_args, data_args, training_args, finetuning_args, callbacks)
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\train\orpo\workflow.py", line 29, in run_orpo
    dataset = get_dataset(model_args, data_args, training_args, stage="rm", **tokenizer_module)
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\loader.py", line 164, in get_dataset
    dataset = dataset.map(preprocess_func, batched=True, remove_columns=column_names, **kwargs)
  File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 602, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 567, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 3156, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 3547, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "C:\LLaMA-Factory\venv\lib\site-packages\datasets\arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\preprocess.py", line 242, in preprocess_pairwise_dataset
    _, rejected_ids = template.encode_oneturn(
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\template.py", line 45, in encode_oneturn
    encoded_pairs = self._encode(tokenizer, messages, system, tools, cutoff_len, reserved_label_len)
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\template.py", line 94, in _encode
    elements += self.format_assistant.apply(content=message["content"])
  File "C:\LLaMA-Factory\venv\lib\site-packages\llmtuner\data\formatter.py", line 116, in apply
    raise RuntimeError("Expected a string, got {}".format(value))
RuntimeError: Expected a string, got None

I did the following:

Removed empty strings and added blanks, converted any value to a string, removed possible None values, and double-checked that only strings were left for every key.
I split the data into two halves, but it still consistently stops at 7% during the Running tokenizer process. In the end, I created a workaround by modifying the preprocess_pairwise_dataset function in the preprocess.py fileto handles cases where message["content"] is None. If message["content"] is None, it is replaced with an empty string for both chosen_messages and rejected_messages.

preprocess_pairwise_dataset

def preprocess_pairwise_dataset(
    examples: Dict[str, List[Any]],
    template: "Template",
    tokenizer: "PreTrainedTokenizer",
    processor: Optional["ProcessorMixin"],
    data_args: "DataArguments",
) -> Dict[str, List[List[int]]]:
    # build input pairs with format `<bos> X`, `Y1 <eos>` and `Y2 <eos>`
    model_inputs = {"prompt_ids": [], "chosen_ids": [], "rejected_ids": []}
    if processor is not None:
        model_inputs["pixel_values"] = []
        preprocess_visual_inputs = partial(_preprocess_visual_inputs, processor=processor)

    for i in range(len(examples["prompt"])):
        if len(examples["prompt"][i]) % 2 != 1 or len(examples["response"][i]) < 2:
            continue

        if processor is not None:
            examples["prompt"][i][0]["content"] = "<image>" + examples["prompt"][i][0]["content"]

        chosen_messages = examples["prompt"][i] + [examples["response"][i][0]]
        rejected_messages = examples["prompt"][i] + [examples["response"][i][1]]

        # Check if message["content"] is None and replace it with an empty string/null
        for message in rejected_messages:
            if message["content"] is None:
                message["content"] = "null"

        prompt_ids, chosen_ids = template.encode_oneturn(
            tokenizer,
            chosen_messages,
            examples["system"][i],
            examples["tools"][i],
            data_args.cutoff_len,
            data_args.reserved_label_len,
        )
        _, rejected_ids = template.encode_oneturn(
            tokenizer,
            rejected_messages,
            examples["system"][i],
            examples["tools"][i],
            data_args.cutoff_len,
            data_args.reserved_label_len,
        )

        if template.efficient_eos:
            chosen_ids += [tokenizer.eos_token_id]
            rejected_ids += [tokenizer.eos_token_id]

        model_inputs["prompt_ids"].append(prompt_ids)
        model_inputs["chosen_ids"].append(chosen_ids)
        model_inputs["rejected_ids"].append(rejected_ids)
        if processor is not None:
            model_inputs["pixel_values"].append(preprocess_visual_inputs(examples["images"][i]))

    return model_inputs

A simple JSON check for LLaMA-Factory script would be useful, as I cannot identify line or format issue when use LLaMA-Factory dataset loader.

Edit1: I'v printed all None. Only rejected_messages "content" cause issues, 1-2 was

    "output": [
      "Answer1 ....",
      null
    ],

300+ Others are math/code (e.g. !is_none) characters or "content" not part of my dataset...?!

May 03 '24 05:05 Katehuuh