axolotl Preprocess failure for llama3 instruct prompt

Preprocess failure for llama3 instruct prompt

Open ryj0902 opened this issue 9 months ago • 0 comments

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

python -m axolotl.cli.preprocess test.yaml --debug should be success like below: (different dataset, executed before #1553 is merged)

...
[2024-04-30 04:24:24,115] [DEBUG] [axolotl.normalize_config:79] [PID:67864] [RANK:0] bf16 support detected, enabling for this configuration.                                                                            
[2024-04-30 04:24:24,560] [INFO] [axolotl.normalize_config:182] [PID:67864] [RANK:0] GPU memory usage baseline: 0.000GB (+0.609GB misc)                                                                                 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.                                                                                                   
[2024-04-30 04:24:25,611] [DEBUG] [axolotl.load_tokenizer:279] [PID:67864] [RANK:0] EOS: 128001 / <|end_of_text|>                                                                                                       
[2024-04-30 04:24:25,611] [DEBUG] [axolotl.load_tokenizer:280] [PID:67864] [RANK:0] BOS: 128000 / <|begin_of_text|>                                                                                                     
[2024-04-30 04:24:25,611] [DEBUG] [axolotl.load_tokenizer:281] [PID:67864] [RANK:0] PAD: 128001 / <|end_of_text|>                                                                                                       
[2024-04-30 04:24:25,612] [DEBUG] [axolotl.load_tokenizer:282] [PID:67864] [RANK:0] UNK: None / None                                                                                                                    
[2024-04-30 04:24:25,612] [INFO] [axolotl.load_tokenizer:293] [PID:67864] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.                                                     
[2024-04-30 04:24:25,612] [INFO] [axolotl.load_tokenized_prepared_datasets:183] [PID:67864] [RANK:0] Unable to find prepared dataset in /home/llm/data/last_run_prepared/d453405904283e23b947ef88b2e1e328               
[2024-04-30 04:24:25,612] [INFO] [axolotl.load_tokenized_prepared_datasets:184] [PID:67864] [RANK:0] Loading raw datasets...                                                                                            
[2024-04-30 04:24:26,497] [INFO] [axolotl.load_tokenized_prepared_datasets:410] [PID:67864] [RANK:0] merging datasets                                                                                                   
[2024-04-30 04:24:26,503] [DEBUG] [axolotl.log:61] [PID:67864] [RANK:0] min_input_len: 134                                                                                                                              
[2024-04-30 04:24:26,503] [DEBUG] [axolotl.log:61] [PID:67864] [RANK:0] max_input_len: 134                                                                                                                              
Dropping Long Sequences (num_proc=96): 100%|██████████████████████████████████████| 100/100 [00:00<00:00, 182.17 examples/s]
Add position_id column (Sample Packing) (num_proc=96): 100%|█████████████████████████████████| 100/100 [00:00<00:00, 155.16 examples/s]
[2024-04-30 04:24:29,969] [INFO] [axolotl.load_tokenized_prepared_datasets:423] [PID:67864] [RANK:0] Saving merged prepared dataset to disk... /home/llm/data/last_run_prepared/d453405904283e23b947ef88b2e1e328        
Saving the dataset (1/1 shards): 100%|███████████████████████████████████████| 100/100 [00:00<00:00, 3510.52 examples/s]
[2024-04-30 04:24:30,026] [DEBUG] [axolotl.log:61] [PID:67864] [RANK:0] total_num_tokens: 12_730                                                                                                                        
[2024-04-30 04:24:30,028] [DEBUG] [axolotl.log:61] [PID:67864] [RANK:0] `total_supervised_tokens: 380`
...

Current behaviour

fail with error messages below:

...
[2024-05-14 05:16:58,815] [WARNING] [axolotl.utils.config.models.input.hint_lora_8bit:924] [PID:40540] [RANK:0] We recommend setting `load_in_8bit: true` for LORA finetuning
[2024-05-14 05:16:58,815] [DEBUG] [axolotl.normalize_config:79] [PID:40540] [RANK:0] bf16 support detected, enabling for this configuration.
[2024-05-14 05:16:59,230] [INFO] [axolotl.normalize_config:182] [PID:40540] [RANK:0] GPU memory usage baseline: 0.000GB (+0.047GB misc)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-05-14 05:17:00,142] [DEBUG] [axolotl.load_tokenizer:280] [PID:40540] [RANK:0] EOS: 128009 / <|eot_id|>
[2024-05-14 05:17:00,142] [DEBUG] [axolotl.load_tokenizer:281] [PID:40540] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-05-14 05:17:00,142] [DEBUG] [axolotl.load_tokenizer:282] [PID:40540] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-05-14 05:17:00,142] [DEBUG] [axolotl.load_tokenizer:283] [PID:40540] [RANK:0] UNK: None / None
[2024-05-14 05:17:00,142] [INFO] [axolotl.load_tokenizer:294] [PID:40540] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-05-14 05:17:00,143] [INFO] [axolotl.load_tokenized_prepared_datasets:183] [PID:40540] [RANK:0] Unable to find prepared dataset in /home/work/.test/data/last_run_prepared/d58d05328886fb932ef4b2db9de5724d
[2024-05-14 05:17:00,143] [INFO] [axolotl.load_tokenized_prepared_datasets:184] [PID:40540] [RANK:0] Loading raw datasets...
[2024-05-14 05:17:00,792] [ERROR] [axolotl.get_dataset_wrapper:674] [PID:40540] [RANK:0] unhandled prompt tokenization strategy: sharegpt. 
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/work/.test/axolotl/src/axolotl/cli/preprocess.py", line 82, in <module>
    fire.Fire(do_cli)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 463, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 672, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/work/.test/axolotl/src/axolotl/cli/preprocess.py", line 72, in do_cli
    load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)
  File "/home/work/.test/axolotl/src/axolotl/cli/__init__.py", line 403, in load_datasets
    train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
  File "/home/work/.test/axolotl/src/axolotl/utils/data/sft.py", line 66, in prepare_dataset
    train_dataset, eval_dataset, prompters = load_prepare_datasets(
  File "/home/work/.test/axolotl/src/axolotl/utils/data/sft.py", line 460, in load_prepare_datasets
    dataset, prompters = load_tokenized_prepared_datasets(
  File "/home/work/.test/axolotl/src/axolotl/utils/data/sft.py", line 399, in load_tokenized_prepared_datasets
    dataset_wrapper, dataset_prompter = get_dataset_wrapper(
  File "/home/work/.test/axolotl/src/axolotl/utils/data/sft.py", line 677, in get_dataset_wrapper
    raise ValueError(
ValueError: unhandled prompt tokenization strategy: sharegpt

Interestingly, training command below runs fine without any errors. accelerate launch -m axolotl.cli.train configs/test.yaml

Steps to reproduce

I don't think the data is important, but I've attached example data below. (some of val.jsonl):

{"conversations": [{"from": "system", "value": "You are a helpful AI assistant."}, {"from": "user", "value": "Question: A junior orthopaedic surgery resident is completing a carpal tunnel repair with the department chairman as the attending physician. During the case, the resident inadvertently cuts a flexor tendon. The tendon is repaired without complication. The attending tells the resident that the patient will do fine, and there is no need to report this minor complication that will not harm the patient, as he does not want to make the patient worry unnecessarily. He tells the resident to leave this complication out of the operative report. Which of the following is the correct next action for the resident to take?\nA. Disclose the error to the patient and put it in the operative report\nB. Tell the attending that he cannot fail to disclose this mistake\nC. Report the physician to the ethics committee\nD. Refuse to dictate the operative report\n"}, {"from": "assistant", "value": "Answer: B. Tell the attending that he cannot fail to disclose this mistake"}]}

run python -m axolotl.cli.preprocess test.yaml --debug

Config yaml

base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

data_seed: 49
seed: 49

datasets:
  - path: /home/work/.test/data/pubmed/train.jsonl
    type: sharegpt
    conversation: llama3
    train_on_split: train

  - path: /home/work/.test/data/pubmed/val.jsonl
    type: sharegpt
    conversation: llama3
    train_on_split: validation
dataset_prepared_path: /home/work/.test/data/last_run_prepared
output_dir: ./out

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

gradient_accumulation_steps: 8
micro_batch_size: 4
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 2e-5

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
logging_steps:
xformers_attention:
flash_attention: true

warmup_steps: 100
evals_per_epoch: 1
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
special_tokens:
  pad_token: <|end_of_text|>

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10.12

axolotl branch-commit

main/2147cf68 Llama3 dpo (#1610)

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

May 14 '24 05:05 ryj0902

axolotl axolotl copied to clipboard

Preprocess failure for llama3 instruct prompt

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

axolotl
axolotl copied to clipboard