LLaMA-Factory ValueError: Failed to convert pandas DataFrame to Arrow Table from file

Reminder

[X] I have read the README and searched the existing issues.

System Info

Generating train split: 0 examples [00:00, ? examples/s]Failed to convert pandas Da[62/1867]
o Arrow Table from file '/data/zhaopengfeng/LLaMA-Factory/data/kddcup/openai_track4_0702.jso
n' with error <class 'pyarrow.lib.ArrowInvalid'>: ('cannot mix list and non-list, non-null v
alues', 'Conversion failed for column messages with type object')                           
Generating train split: 0 examples [00:00, ? examples/s]                                    
[rank3]: Traceback (most recent call last):                                                 
[rank3]:   File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/builder.py", line 1997, in _prepare_split_single                                 
[rank3]:     for _, table in generator:                                                     
[rank3]:   File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/packaged_modules/json/json.py", line 165, in _generate_tables                    
[rank3]:     raise ValueError(                                                              
[rank3]: ValueError: Failed to convert pandas DataFrame to Arrow Table from file /data/zhaop
engfeng/LLaMA-Factory/data/kddcup/openai_track4_0702.json.                                  
                                                                                            
[rank3]: The above exception was the direct cause of the following exception:               
                                                                                            
[rank3]: Traceback (most recent call last):                                                 
[rank3]:   File "/data/zhaopengfeng/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
 <module>                                                                                   
[rank3]:     launch()                                                                       
[rank3]:   File "/data/zhaopengfeng/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in
 launch                                                                                     
[rank3]:     run_exp()                                                                      
[rank3]:   File "/data/zhaopengfeng/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50,
 in run_exp                                                                                 
[rank3]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args,
 callbacks)                                                                                 
[rank3]:   File "/data/zhaopengfeng/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", l
ine 48, in run_sft                                                                          
[rank3]:     dataset = get_dataset(model_args, data_args, training_args, stage="sft", **toke
nizer_module)                                                                               
[rank3]:   File "/data/zhaopengfeng/LLaMA-Factory/src/llamafactory/data/loader.py", line 174
, in get_dataset                                                                            
[rank3]:     all_datasets.append(load_single_dataset(dataset_attr, model_args, data_args, tr
aining_args))                                                                               
[rank3]:   File "/data/zhaopengfeng/LLaMA-Factory/src/llamafactory/data/loader.py", line 109
, in load_single_dataset                                                                    
[rank3]:     dataset = load_dataset(                                                        
[rank3]:   File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/load.py", line 2616, in load_dataset
[rank3]:     builder_instance.download_and_prepare(
[rank3]:   File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/builder.py", line 1029, in download_and_prepare
[rank3]:     self._download_and_prepare(
[rank3]:   File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/builder.py", line 1124, in _download_and_prepare
[rank3]:     self._prepare_split(split_generator, **prepare_split_kwargs)
[rank3]:   File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/builder.py", line 1884, in _prepare_split
[rank3]:     for job_id, done, content in self._prepare_split_single(
[rank3]:   File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/builder.py", line 2040, in _prepare_split_single
[rank3]:     raise DatasetGenerationError("An error occurred while generating the dataset") 
from e...

Reproduction

Dataset info

  "openai_track4_0702": {
    "file_name": "/data/zhaopengfeng/LLaMA-Factory/data/kddcup/openai_track4_0702.json",
    "formatting": "sharegpt",
    "columns": {
      "messages": "messages"
    },
    "tags": {
      "role_tag": "role",
      "content_tag": "content",
      "user_tag": "user",
      "assistant_tag": "assistant",
      "system_tag": "system"
    }
  }

Data example

[
  {
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful online shopping assistant.Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n"
      },
      {
        "role": "user",
        "content": "Instructions: Explain the product category name\nInput: iPhone\n"
      },
      {
        "role": "assistant",
        "content": "An iPhone is a smartphone designed and manufactured by Apple Inc. that combines mobile phone capabilities with features such as internet access, multimedia playback, and a touch screen interface."
      }
    ]
  },
  {
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful online shopping assistant.Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n"
      },
      {
        "role": "user",
        "content": "Input: Explain the product type Serving Bowl\n"
      },
      {
        "role": "assistant",
        "content": "A serving bowl is a large bowl used for presenting and serving food, such as salads, vegetables, or side dishes, at the dining table.\n"
      },
      {
        "role": "user",
        "content": "Input: Explain the product type Apron\n"
      },
      {
        "role": "assistant",
        "content": "An apron is a protective garment worn over clothing to shield the wearer from spills, stains, and other kitchen or work-related messes. Aprons are commonly used in cooking, baking, and various crafts or occupations."
      }
    ]
  },...
]

train script

### model
model_name_or_path: /data/zhaopengfeng/models/glm-4-9b-chat

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
deepspeed: examples/deepspeed/ds_z2_config.json

### dataset
dataset: openai_track4_0702
template: glm4
cutoff_len: 1024
max_samples: 5000
overwrite_cache: true
preprocessing_num_workers: 16
...

My environment

transformers                      4.42.3
triton                            2.3.0
llamafactory                      0.8.3.dev0   /data/zhaopengfeng/LLaMA-Factory   
...
CUDA Driver 12.5

BTW, the alpaca template works well.

Expected behavior

lora sft

Others

No response

Jul 02 '24 09:07 fzp0424

I also encountered this problem. Have you solved it?

Aug 29 '24 16:08 Winston-Yuan

I am also facing similar issue. Mostly looks like this is because of dictionary in the labels i guess. Did anyone tried to solve this issue ?

Oct 01 '24 03:10 paturi1710

Same issue. Have anyone solved it pls?

Oct 21 '24 03:10 Tanyuxuan008

I have same problem. I noticed some of the labels are not string type. After converting all the labels to string type. It worked for me

Oct 21 '24 04:10 paturi1710

Make sure all your inputs are string type

Oct 21 '24 06:10 hiyouga

I have the same issue with a custom dataset. Everything is a string, however some of the 'content' values are stringified jsons, could that be an issue?

Dec 03 '24 17:12 majthehero

I have the same issue with a custom dataset. And I am sure that everything is string type. but still meet the same error.

Apr 15 '25 09:04 Flemington7

I have the same issue with a custom dataset. And I am sure that everything is string type. but still meet the same error.

I met the same issue and it turns out that I didn't add question_id to the data.

Jul 27 '25 18:07 jasper0314-huang