axolotl
axolotl copied to clipboard
Preprocess failure for llama3 instruct prompt
Please check that this issue hasn't been reported before.
- [X] I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
python -m axolotl.cli.preprocess test.yaml --debug
should be success like below:
(different dataset, executed before #1553 is merged)
...
[2024-04-30 04:24:24,115] [DEBUG] [axolotl.normalize_config:79] [PID:67864] [RANK:0] bf16 support detected, enabling for this configuration.
[2024-04-30 04:24:24,560] [INFO] [axolotl.normalize_config:182] [PID:67864] [RANK:0] GPU memory usage baseline: 0.000GB (+0.609GB misc)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-04-30 04:24:25,611] [DEBUG] [axolotl.load_tokenizer:279] [PID:67864] [RANK:0] EOS: 128001 / <|end_of_text|>
[2024-04-30 04:24:25,611] [DEBUG] [axolotl.load_tokenizer:280] [PID:67864] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-04-30 04:24:25,611] [DEBUG] [axolotl.load_tokenizer:281] [PID:67864] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-04-30 04:24:25,612] [DEBUG] [axolotl.load_tokenizer:282] [PID:67864] [RANK:0] UNK: None / None
[2024-04-30 04:24:25,612] [INFO] [axolotl.load_tokenizer:293] [PID:67864] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-04-30 04:24:25,612] [INFO] [axolotl.load_tokenized_prepared_datasets:183] [PID:67864] [RANK:0] Unable to find prepared dataset in /home/llm/data/last_run_prepared/d453405904283e23b947ef88b2e1e328
[2024-04-30 04:24:25,612] [INFO] [axolotl.load_tokenized_prepared_datasets:184] [PID:67864] [RANK:0] Loading raw datasets...
[2024-04-30 04:24:26,497] [INFO] [axolotl.load_tokenized_prepared_datasets:410] [PID:67864] [RANK:0] merging datasets
[2024-04-30 04:24:26,503] [DEBUG] [axolotl.log:61] [PID:67864] [RANK:0] min_input_len: 134
[2024-04-30 04:24:26,503] [DEBUG] [axolotl.log:61] [PID:67864] [RANK:0] max_input_len: 134
Dropping Long Sequences (num_proc=96): 100%|██████████████████████████████████████| 100/100 [00:00<00:00, 182.17 examples/s]
Add position_id column (Sample Packing) (num_proc=96): 100%|█████████████████████████████████| 100/100 [00:00<00:00, 155.16 examples/s]
[2024-04-30 04:24:29,969] [INFO] [axolotl.load_tokenized_prepared_datasets:423] [PID:67864] [RANK:0] Saving merged prepared dataset to disk... /home/llm/data/last_run_prepared/d453405904283e23b947ef88b2e1e328
Saving the dataset (1/1 shards): 100%|███████████████████████████████████████| 100/100 [00:00<00:00, 3510.52 examples/s]
[2024-04-30 04:24:30,026] [DEBUG] [axolotl.log:61] [PID:67864] [RANK:0] total_num_tokens: 12_730
[2024-04-30 04:24:30,028] [DEBUG] [axolotl.log:61] [PID:67864] [RANK:0] `total_supervised_tokens: 380`
...
Current behaviour
fail with error messages below:
...
[2024-05-14 05:16:58,815] [WARNING] [axolotl.utils.config.models.input.hint_lora_8bit:924] [PID:40540] [RANK:0] We recommend setting `load_in_8bit: true` for LORA finetuning
[2024-05-14 05:16:58,815] [DEBUG] [axolotl.normalize_config:79] [PID:40540] [RANK:0] bf16 support detected, enabling for this configuration.
[2024-05-14 05:16:59,230] [INFO] [axolotl.normalize_config:182] [PID:40540] [RANK:0] GPU memory usage baseline: 0.000GB (+0.047GB misc)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-05-14 05:17:00,142] [DEBUG] [axolotl.load_tokenizer:280] [PID:40540] [RANK:0] EOS: 128009 / <|eot_id|>
[2024-05-14 05:17:00,142] [DEBUG] [axolotl.load_tokenizer:281] [PID:40540] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-05-14 05:17:00,142] [DEBUG] [axolotl.load_tokenizer:282] [PID:40540] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-05-14 05:17:00,142] [DEBUG] [axolotl.load_tokenizer:283] [PID:40540] [RANK:0] UNK: None / None
[2024-05-14 05:17:00,142] [INFO] [axolotl.load_tokenizer:294] [PID:40540] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-05-14 05:17:00,143] [INFO] [axolotl.load_tokenized_prepared_datasets:183] [PID:40540] [RANK:0] Unable to find prepared dataset in /home/work/.test/data/last_run_prepared/d58d05328886fb932ef4b2db9de5724d
[2024-05-14 05:17:00,143] [INFO] [axolotl.load_tokenized_prepared_datasets:184] [PID:40540] [RANK:0] Loading raw datasets...
[2024-05-14 05:17:00,792] [ERROR] [axolotl.get_dataset_wrapper:674] [PID:40540] [RANK:0] unhandled prompt tokenization strategy: sharegpt.
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/work/.test/axolotl/src/axolotl/cli/preprocess.py", line 82, in <module>
fire.Fire(do_cli)
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 463, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 672, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/work/.test/axolotl/src/axolotl/cli/preprocess.py", line 72, in do_cli
load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)
File "/home/work/.test/axolotl/src/axolotl/cli/__init__.py", line 403, in load_datasets
train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
File "/home/work/.test/axolotl/src/axolotl/utils/data/sft.py", line 66, in prepare_dataset
train_dataset, eval_dataset, prompters = load_prepare_datasets(
File "/home/work/.test/axolotl/src/axolotl/utils/data/sft.py", line 460, in load_prepare_datasets
dataset, prompters = load_tokenized_prepared_datasets(
File "/home/work/.test/axolotl/src/axolotl/utils/data/sft.py", line 399, in load_tokenized_prepared_datasets
dataset_wrapper, dataset_prompter = get_dataset_wrapper(
File "/home/work/.test/axolotl/src/axolotl/utils/data/sft.py", line 677, in get_dataset_wrapper
raise ValueError(
ValueError: unhandled prompt tokenization strategy: sharegpt
Interestingly, training command below runs fine without any errors.
accelerate launch -m axolotl.cli.train configs/test.yaml
Steps to reproduce
I don't think the data is important, but I've attached example data below. (some of val.jsonl):
{"conversations": [{"from": "system", "value": "You are a helpful AI assistant."}, {"from": "user", "value": "Question: A junior orthopaedic surgery resident is completing a carpal tunnel repair with the department chairman as the attending physician. During the case, the resident inadvertently cuts a flexor tendon. The tendon is repaired without complication. The attending tells the resident that the patient will do fine, and there is no need to report this minor complication that will not harm the patient, as he does not want to make the patient worry unnecessarily. He tells the resident to leave this complication out of the operative report. Which of the following is the correct next action for the resident to take?\nA. Disclose the error to the patient and put it in the operative report\nB. Tell the attending that he cannot fail to disclose this mistake\nC. Report the physician to the ethics committee\nD. Refuse to dictate the operative report\n"}, {"from": "assistant", "value": "Answer: B. Tell the attending that he cannot fail to disclose this mistake"}]}
run python -m axolotl.cli.preprocess test.yaml --debug
Config yaml
base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: false
strict: false
data_seed: 49
seed: 49
datasets:
- path: /home/work/.test/data/pubmed/train.jsonl
type: sharegpt
conversation: llama3
train_on_split: train
- path: /home/work/.test/data/pubmed/val.jsonl
type: sharegpt
conversation: llama3
train_on_split: validation
dataset_prepared_path: /home/work/.test/data/last_run_prepared
output_dir: ./out
sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
gradient_accumulation_steps: 8
micro_batch_size: 4
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 2e-5
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
logging_steps:
xformers_attention:
flash_attention: true
warmup_steps: 100
evals_per_epoch: 1
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
special_tokens:
pad_token: <|end_of_text|>
Possible solution
No response
Which Operating Systems are you using?
- [X] Linux
- [ ] macOS
- [ ] Windows
Python Version
3.10.12
axolotl branch-commit
main/2147cf68 Llama3 dpo (#1610)
Acknowledgements
- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.