ValueError: Failed to convert pandas DataFrame to Arrow Table from file
Reminder
- [X] I have read the README and searched the existing issues.
System Info
Generating train split: 0 examples [00:00, ? examples/s]Failed to convert pandas Da[62/1867]
o Arrow Table from file '/data/zhaopengfeng/LLaMA-Factory/data/kddcup/openai_track4_0702.jso
n' with error <class 'pyarrow.lib.ArrowInvalid'>: ('cannot mix list and non-list, non-null v
alues', 'Conversion failed for column messages with type object')
Generating train split: 0 examples [00:00, ? examples/s]
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/builder.py", line 1997, in _prepare_split_single
[rank3]: for _, table in generator:
[rank3]: File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/packaged_modules/json/json.py", line 165, in _generate_tables
[rank3]: raise ValueError(
[rank3]: ValueError: Failed to convert pandas DataFrame to Arrow Table from file /data/zhaop
engfeng/LLaMA-Factory/data/kddcup/openai_track4_0702.json.
[rank3]: The above exception was the direct cause of the following exception:
[rank3]: Traceback (most recent call last):
[rank3]: File "/data/zhaopengfeng/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
<module>
[rank3]: launch()
[rank3]: File "/data/zhaopengfeng/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in
launch
[rank3]: run_exp()
[rank3]: File "/data/zhaopengfeng/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50,
in run_exp
[rank3]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args,
callbacks)
[rank3]: File "/data/zhaopengfeng/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", l
ine 48, in run_sft
[rank3]: dataset = get_dataset(model_args, data_args, training_args, stage="sft", **toke
nizer_module)
[rank3]: File "/data/zhaopengfeng/LLaMA-Factory/src/llamafactory/data/loader.py", line 174
, in get_dataset
[rank3]: all_datasets.append(load_single_dataset(dataset_attr, model_args, data_args, tr
aining_args))
[rank3]: File "/data/zhaopengfeng/LLaMA-Factory/src/llamafactory/data/loader.py", line 109
, in load_single_dataset
[rank3]: dataset = load_dataset(
[rank3]: File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/load.py", line 2616, in load_dataset
[rank3]: builder_instance.download_and_prepare(
[rank3]: File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/builder.py", line 1029, in download_and_prepare
[rank3]: self._download_and_prepare(
[rank3]: File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/builder.py", line 1124, in _download_and_prepare
[rank3]: self._prepare_split(split_generator, **prepare_split_kwargs)
[rank3]: File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/builder.py", line 1884, in _prepare_split
[rank3]: for job_id, done, content in self._prepare_split_single(
[rank3]: File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/builder.py", line 2040, in _prepare_split_single
[rank3]: raise DatasetGenerationError("An error occurred while generating the dataset")
from e...
Reproduction
Dataset info
"openai_track4_0702": {
"file_name": "/data/zhaopengfeng/LLaMA-Factory/data/kddcup/openai_track4_0702.json",
"formatting": "sharegpt",
"columns": {
"messages": "messages"
},
"tags": {
"role_tag": "role",
"content_tag": "content",
"user_tag": "user",
"assistant_tag": "assistant",
"system_tag": "system"
}
}
Data example
[
{
"messages": [
{
"role": "system",
"content": "You are a helpful online shopping assistant.Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n"
},
{
"role": "user",
"content": "Instructions: Explain the product category name\nInput: iPhone\n"
},
{
"role": "assistant",
"content": "An iPhone is a smartphone designed and manufactured by Apple Inc. that combines mobile phone capabilities with features such as internet access, multimedia playback, and a touch screen interface."
}
]
},
{
"messages": [
{
"role": "system",
"content": "You are a helpful online shopping assistant.Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n"
},
{
"role": "user",
"content": "Input: Explain the product type Serving Bowl\n"
},
{
"role": "assistant",
"content": "A serving bowl is a large bowl used for presenting and serving food, such as salads, vegetables, or side dishes, at the dining table.\n"
},
{
"role": "user",
"content": "Input: Explain the product type Apron\n"
},
{
"role": "assistant",
"content": "An apron is a protective garment worn over clothing to shield the wearer from spills, stains, and other kitchen or work-related messes. Aprons are commonly used in cooking, baking, and various crafts or occupations."
}
]
},...
]
train script
### model
model_name_or_path: /data/zhaopengfeng/models/glm-4-9b-chat
### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
deepspeed: examples/deepspeed/ds_z2_config.json
### dataset
dataset: openai_track4_0702
template: glm4
cutoff_len: 1024
max_samples: 5000
overwrite_cache: true
preprocessing_num_workers: 16
...
My environment
transformers 4.42.3
triton 2.3.0
llamafactory 0.8.3.dev0 /data/zhaopengfeng/LLaMA-Factory
...
CUDA Driver 12.5
BTW, the alpaca template works well.
Expected behavior
lora sft
Others
No response
I also encountered this problem. Have you solved it?
I am also facing similar issue. Mostly looks like this is because of dictionary in the labels i guess. Did anyone tried to solve this issue ?
Same issue. Have anyone solved it pls?
I have same problem. I noticed some of the labels are not string type. After converting all the labels to string type. It worked for me
Make sure all your inputs are string type
I have the same issue with a custom dataset. Everything is a string, however some of the 'content' values are stringified jsons, could that be an issue?
I have the same issue with a custom dataset. And I am sure that everything is string type. but still meet the same error.
I have the same issue with a custom dataset. And I am sure that everything is string type. but still meet the same error.
I met the same issue and it turns out that I didn't add question_id to the data.