LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

Runtime error, invalid examples

Open berkeleyljj opened this issue 8 months ago • 2 comments

Reminder

  • [x] I have read the above rules and searched the existing issues.

System Info

I am trying to use "sharegpt" format to do LoRA ft with my custom dataset.

Bug: Traceback (most recent call last): File "/home/ubuntu/llamaf/src/llamafactory/data/loader.py", line 267, in _get_preprocessed_dataset dataset_processor.print_data_example(next(iter(dataset))) StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/ubuntu/llamaf/llamaf/bin/llamafactory-cli", line 10, in sys.exit(main()) File "/home/ubuntu/llamaf/src/llamafactory/cli.py", line 117, in main run_exp() File "/home/ubuntu/llamaf/src/llamafactory/train/tuner.py", line 107, in run_exp _training_function(config={"args": args, "callbacks": callbacks}) File "/home/ubuntu/llamaf/src/llamafactory/train/tuner.py", line 69, in _training_function run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/home/ubuntu/llamaf/src/llamafactory/train/sft/workflow.py", line 51, in run_sft dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module) File "/home/ubuntu/llamaf/src/llamafactory/data/loader.py", line 310, in get_dataset dataset = _get_preprocessed_dataset( File "/home/ubuntu/llamaf/src/llamafactory/data/loader.py", line 272, in _get_preprocessed_dataset raise RuntimeError("Cannot find valid samples, check data/README.md for the data format.") RuntimeError: Cannot find valid samples, check data/README.md for the data format.

Important information (some content omitted for privacy):

  1. It's my custom data set following this example format: [ { "conversations": [ { "from": "human", "value": ""You are an expert software engineering manager working on the Expensify repository. You have tasked your team with addressing the following issue:\n\n[HOLD for payment 2023-04-18] ..." }, { "from": "function_call", "value": "import os\n\n# Write the decision to the required file\ndecision = {\n "selected_proposal_id": 0\n}\nwith open('/app/expensify/manager_decisions.json', 'w') as f:\n import json\n json.dump(decision, f)\nprint("Decision written successfully.")" }, { "from": "observation", "value": "Decision written successfully.\n" }, { "from": "gpt", "value": "model response ..." } ]

  2. I read previously similar issues. I've used regex and cleanup script to remove any character that is not recognizable by my tokenizer.

  3. I've also checked data/README.md thoroughly and configured my dataset_info.json according to sharegpt requirements like this: "traces": { "file_name": "traces.json", "formatting": "sharegpt", "split": "train", "columns": { "messages": "conversations", "system": "system", "tools": "tools" },

Reproduction

llamafactory-cli train examples/train_lora/{your_file_name}.yaml

Others

No response

berkeleyljj avatar Apr 19 '25 21:04 berkeleyljj

you are using (--template argument). If the template doesn't know how to format "function_call" or "observation", conversations containing them might be filtered out or cause errors during processing. This is a very likely cause. What value are you passing for the --template argument in your command line? Temporarily use a basic template known to work well with human/gpt roles to see if any data gets processed. If it does, the issue is template compatibility with your custom roles.

[
  {
    "conversations": [
      {
        "from": "human",
        "value": "Hello, World!"
      },
      {
        "from": "gpt",
        "value": "42"
      }
    ]
  }
]

Update dataset_info.json to point to this file (or rename it temporarily) and test. and just to be sure run jq . traces.json > /dev/null

rzgarespo avatar Apr 20 '25 03:04 rzgarespo

Yes that is the cause thanks! However template is not a command line argument; instead, it's set in the yaml file. In my case I am using "template: qwen" since that's the model I want to fine-tune. I will try to modify the template script so it covers the two additional fields of tool and observation in sharegpt format. Meanwhile, is there any suggested fix? Please let me know!

berkeleyljj avatar Apr 20 '25 04:04 berkeleyljj