llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

Typeerror while converting to streaming format

Open nikhilranjan7 opened this issue 1 year ago • 2 comments

I am trying to convert Redpajama-github dataset to streaming format but getting the error as below.

To replicate: python llm-foundry/scripts/data_prep/convert_dataset_json.py
--path github/split1
--out_root github/split1 --split train
--concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>'
--compression zstd > split1.txt

Downloading data files: 100%|██████████████████████| 1/1 [00:00<00:00, 3744.91it/s]
Extracting data files: 100%|█████████████████████████| 1/1 [00:00<00:00,  2.17it/s]
Traceback (most recent call last):                                 
  File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/builder.py", line 1874, in _prepare_split_single
    writer.write_table(table)
  File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/arrow_writer.py", line 568, in write_table
    pa_table = table_cast(pa_table, self._schema)
  File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/table.py", line 2312, in table_cast
    return cast_table_to_schema(table, schema)
  File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/table.py", line 2271, in cast_table_to_schema
    arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
  File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/table.py", line 2271, in <listcomp>
    arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
  File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/table.py", line 1837, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/table.py", line 1837, in <listcomp>
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/table.py", line 2132, in cast_array_to_feature
    raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{feature}")
TypeError: Couldn't cast array of type
struct<content_hash: string, timestamp: string, source: string, line_count: int64, max_line_length: int64, avg_line_length: double, alnum_prop: double, repo_name: string, id: string, size: string, binary: bool, copies: string, ref: string, path: string, mode: string, license: string, language: list<item: struct<name: string, bytes: string>>, symlink_target: string>
to
{'content_hash': Value(dtype='string', id=None), 'timestamp': Value(dtype='string', id=None), 'source': Value(dtype='string', id=None), 'line_count': Value(dtype='int64', id=None), 'max_line_length': Value(dtype='int64', id=None), 'avg_line_length': Value(dtype='float64', id=None), 'alnum_prop': Value(dtype='float64', id=None), 'repo_name': Value(dtype='string', id=None), 'id': Value(dtype='string', id=None), 'size': Value(dtype='string', id=None), 'binary': Value(dtype='bool', id=None), 'copies': Value(dtype='string', id=None), 'ref': Value(dtype='string', id=None), 'path': Value(dtype='string', id=None), 'mode': Value(dtype='string', id=None), 'license': Value(dtype='string', id=None), 'language': [{'name': Value(dtype='string', id=None), 'bytes': Value(dtype='string', id=None)}]}

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/scripts/data_prep/convert_dataset_json.py", line 235, in <module>
    main(parse_args())
  File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/scripts/data_prep/convert_dataset_json.py", line 210, in main
    dataset = build_hf_dataset(path=args.path,
  File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/scripts/data_prep/convert_dataset_json.py", line 103, in build_hf_dataset
    hf_dataset = hf_datasets.load_dataset('json',
  File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/load.py", line 1782, in load_dataset
    builder_instance.download_and_prepare(
  File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/builder.py", line 872, in download_and_prepare
    self._download_and_prepare(
  File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/builder.py", line 967, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/builder.py", line 1749, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/lustre/home/nikhil.ranjan/mbzuai-llm/llm-foundry/llmfoundry-venv/lib/python3.10/site-packages/datasets/builder.py", line 1892, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

nikhilranjan7 avatar Jun 03 '23 05:06 nikhilranjan7

hey @nikhilranjan7 do you have the same issue if you try converting a single .jsonl file from Red Pajamas? I'm wondering if this is a result of the pipe you are doing.

codestar12 avatar Jun 06 '23 19:06 codestar12

@nikhilranjan7 I think I have figured out the issue here. It seems to be an issue with huggingface datasets. When I tested multiple .jsonl files using redpajama's arxiv files it worked perfectly. The issue seems to be that the json objects in the github files all have keys and values that are different for each example. When huggingface tries to load multiple json files as a single dataset it throws an error if the keys / values don't match exactly. I'm going to try and see if I can find a way around this issue and expose a flag to the script to ignore the differences.

codestar12 avatar Jun 07 '23 23:06 codestar12

I'm going to close this issue for the moment as this is a problem with huggingface. The alternative would be to rewrite our json script to make a dataset for each json file and merge them. We may do this in the future but isn't in our current plans. If you find yourself stuck that is a path to unblock yourself.

codestar12 avatar Jun 16 '23 14:06 codestar12