openchat
openchat copied to clipboard
Issues tokenizing dataset
Thanks for open-sourcing this!
I am trying to follow the instructions for tokenizing the data, but it fails with the stack trace below. I'm just using two lines of dummy data. Any ideas where this issue is coming from? Thanks!
python -m ochat.data.generate_dataset --model-type "openchat_v3.2_mistral" --model-path "imone/Mistral_7B_with_EOT_token" --in-files data.jsonl --out-prefix pretok.tok
...
...
...
(convert_conversation_batch pid=13365) Chunk finish (convert_conversation_batch pid=13205) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [repeated 28x across cluster] Traceback (most recent call last): File "/opt/conda/envs/ptca/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/envs/ptca/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/tmp/och/openchat/ochat/data/generate_dataset.py", line 167, in <module>
generate_dataset(**vars(args))
File "/tmp/och/openchat/ochat/data/generate_dataset.py", line 149, in generate_dataset
generate_split(model_type, model_path, train_conversations, "train", out_prefix, per_sequence_loss)
File "/tmp/och/openchat/ochat/data/generate_dataset.py", line 131, in generate_split
parquet.write_table(pyarrow.concat_tables([ray.get(handle) for handle in handles]), f"{out_prefix}.{split_name}.parquet")
File "/tmp/och/openchat/ochat/data/generate_dataset.py", line 131, in <listcomp>
parquet.write_table(pyarrow.concat_tables([ray.get(handle) for handle in handles]), f"{out_prefix}.{split_name}.parquet")
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/ray/_private/worker.py", line 2563, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::convert_conversation_batch() (pid=13368, ip=10.4.66.23)
File "/tmp/och/openchat/ochat/data/generate_dataset.py", line 78, in convert_conversation_batch
tokens_list, weights_list = conv_template.tokenize_conversations(batch, inference=False, seq_level_weight=per_sequence_loss)
File "/tmp/och/openchat/ochat/config/conversation_template.py", line 61, in tokenize_conversations
sys_mappings = dict(zip(sys_mappings, self._tokenize(sys_mappings)))
File "/tmp/och/openchat/ochat/config/conversation_template.py", line 42, in _tokenize
return self.tokenizer(strings, split_special_tokens=ignore_special, return_attention_mask=False, add_special_tokens=False).input_ids
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2798, in __call__
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2884, in _call_one
return self.batch_encode_plus(
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3075, in batch_encode_plus
return self._batch_encode_plus(
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 807, in _batch_encode_plus
batch_outputs = self._batch_prepare_for_model(
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 879, in _batch_prepare_for_model
batch_outputs = self.pad(
File "/opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3214, in pad
raise ValueError(
ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided []
It might be the case that the provided dataset contains only small samples. Setting the number of splits as 1 solved the issue.
https://github.com/imoneoi/openchat/blob/master/ochat/data/generate_dataset.py#L128
@tokenatlas did u resolve this issues or found any work around? I am facing same issue