llm-foundry
llm-foundry copied to clipboard
Typeerror while converting to streaming format
I am trying to convert Redpajama-github dataset to streaming format but getting the error as below.
To replicate:
python llm-foundry/scripts/data_prep/convert_dataset_json.py
--path github/split1
--out_root github/split1 --split train
--concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>'
--compression zstd > split1.txt
Downloading data files: 100%|██████████████████████| 1/1 [00:00<00:00, 3744.91it/s]
Extracting data files: 100%|█████████████████████████| 1/1 [00:00<00:00, 2.17it/s]
Traceback (most recent call last):
File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/builder.py", line 1874, in _prepare_split_single
writer.write_table(table)
File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/arrow_writer.py", line 568, in write_table
pa_table = table_cast(pa_table, self._schema)
File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/table.py", line 2312, in table_cast
return cast_table_to_schema(table, schema)
File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/table.py", line 2271, in cast_table_to_schema
arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/table.py", line 2271, in <listcomp>
arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/table.py", line 1837, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/table.py", line 1837, in <listcomp>
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/table.py", line 2132, in cast_array_to_feature
raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{feature}")
TypeError: Couldn't cast array of type
struct<content_hash: string, timestamp: string, source: string, line_count: int64, max_line_length: int64, avg_line_length: double, alnum_prop: double, repo_name: string, id: string, size: string, binary: bool, copies: string, ref: string, path: string, mode: string, license: string, language: list<item: struct<name: string, bytes: string>>, symlink_target: string>
to
{'content_hash': Value(dtype='string', id=None), 'timestamp': Value(dtype='string', id=None), 'source': Value(dtype='string', id=None), 'line_count': Value(dtype='int64', id=None), 'max_line_length': Value(dtype='int64', id=None), 'avg_line_length': Value(dtype='float64', id=None), 'alnum_prop': Value(dtype='float64', id=None), 'repo_name': Value(dtype='string', id=None), 'id': Value(dtype='string', id=None), 'size': Value(dtype='string', id=None), 'binary': Value(dtype='bool', id=None), 'copies': Value(dtype='string', id=None), 'ref': Value(dtype='string', id=None), 'path': Value(dtype='string', id=None), 'mode': Value(dtype='string', id=None), 'license': Value(dtype='string', id=None), 'language': [{'name': Value(dtype='string', id=None), 'bytes': Value(dtype='string', id=None)}]}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/scripts/data_prep/convert_dataset_json.py", line 235, in <module>
main(parse_args())
File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/scripts/data_prep/convert_dataset_json.py", line 210, in main
dataset = build_hf_dataset(path=args.path,
File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/scripts/data_prep/convert_dataset_json.py", line 103, in build_hf_dataset
hf_dataset = hf_datasets.load_dataset('json',
File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/load.py", line 1782, in load_dataset
builder_instance.download_and_prepare(
File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/builder.py", line 872, in download_and_prepare
self._download_and_prepare(
File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/builder.py", line 967, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/builder.py", line 1749, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/builder.py", line 1892, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
hey @nikhilranjan7 do you have the same issue if you try converting a single .jsonl file from Red Pajamas? I'm wondering if this is a result of the pipe you are doing.
@nikhilranjan7 I think I have figured out the issue here. It seems to be an issue with huggingface datasets. When I tested multiple .jsonl files using redpajama's arxiv files it worked perfectly. The issue seems to be that the json objects in the github files all have keys and values that are different for each example. When huggingface tries to load multiple json files as a single dataset it throws an error if the keys / values don't match exactly. I'm going to try and see if I can find a way around this issue and expose a flag to the script to ignore the differences.
I'm going to close this issue for the moment as this is a problem with huggingface. The alternative would be to rewrite our json script to make a dataset for each json file and merge them. We may do this in the future but isn't in our current plans. If you find yourself stuck that is a path to unblock yourself.