llm-foundry
llm-foundry copied to clipboard
Finetune MPT models with local dataset
Hello Team,
Can you please guide on how to finetune on local datasets, the instructions given in scripts/train are not so clear. Below yaml file was given as sample example:
train_loader: name: finetuning dataset: hf_name: my-local-dataset hf_kwargs: data_files: train: /path/to/train.jsonl preprocessing_fn: my.import.path:my_preprocessing_fn split: train
So do we need to create a loading_script to load local dataset using huggingface datasets or is there a way we can directly use jsonl file paths instead of converting it into huggingface dataset & for that what changes I need to make in yaml file
I tried doing the same, and I do agree the instructions are not clear at all.
Sorry for the lack of clarity here, we'll update the docs.
Going off of @arpitkk 's example, you would want to use train_loader.dataset.hf_name: json and keep the rest the same.
Just to clarify, that config will eventually influence the behavior of this dataset-building method here:https://github.com/mosaicml/llm-foundry/blob/6c16a6e9b31abe179dda7d71e9601f866721d6fa/llmfoundry/data/finetuning/tasks.py#L223
So, by setting your config to
train_loader:
name: finetuning
dataset:
hf_name: json
hf_kwargs:
data_files:
train: /path/to/train.jsonl
preprocessing_fn: my.import.path:my_preprocessing_fn
split: train
The build_from_hf method will effectively execute the following code:
import datasets
from my.import.path import my_preprocessing_fn
dataset = datasets.load_dataset('json', split='train', data_files={'train': '/path/to/train.jsonl'})
def dataset_mapper(example: Dict):
example = my_preprocessing_fn(example)
return _tokenize_formatted_example(example, tokenizer)
columns_to_remove = list(dataset[0].keys())
tokenized_dataset = dataset.map(
dataset_mapper,
batched=False,
remove_columns=columns_to_remove,
)
Does that help to clear things up? If so, I can work a similar explanation into the README. If not, please let me know what still feels unclear!
There might be an issue then. Here is the config I am using:
train_loader:
name: finetuning
dataset:
hf_name: json
hf_kwargs:
data_files:
train: /mnt/training/mylocaldataset/train.jsonl
preprocessing_fn: mylocaldataset.utils:prep_fn
split: train
max_seq_len: ${max_seq_len}
allow_pad_trimming: false
decoder_only_format: true
shuffle: true
I am running with: composer llm-foundry/scripts/train/train.py mpt.yml save_folder=mpt-tuned at the path /mnt/training.
wc -l /mnt/training/mylocaldataset/train.jsonl: 1212821
I will debug this today and let you know if I make some progress :)
Issue found!
It seems there are a couple of bugs in the HuggingFace library. First, due to a regex problem the system is mixing the train.jsonl and the split train.
The dataset should not be named train.jsonl. Name it for instance prompts.jsonl.
Also, I now use data_dir instead and it works fine this way.
dataset:
hf_name: json
hf_kwargs:
keep_in_memory: true
data_dir: /mnt/training/mylocaldataset
preprocessing_fn: mylocaldataset.utils:prep_fn
@baptistejamin, do you have to specify the splits? so data_dir is only the dir? how would it find prompt.jsonl then?
i do not know why, but i keep getting error "FileNotFoundError: Unable to find '/workspace/scripts/train/train' at /workspace/scripts/train"
at dataset = datasets.load_dataset(dataset_name, split=split, **kwargs)
my yaml is # Dataloaders train_loader: name: finetuning dataset: hf_name: json hf_kwargs: data_files: train: data/train.json preprocessing_fn: preprocess_investopedia:preprocess_investopedia
the data folder is inside the current working dir.
Yes, keep the spit. I strongly recommend using data_dir rather than data_files. Keep the same config, but replace data_files with data_dir
The global batch size is too high. try with a batch size of 1, and then increase it until OOM
Thanks for helping to surface these issues!!
@baptistejamin @arpitkk Did the explanation I posted above provide a useful intuition for how to set up the YAML? I want to make sure our README instructions are clear. I'll use your feedback to update them.
I'll also aim to include some of the gotchas that you have caught, e.g., data_dir vs data_files, relative pathing.
You should make a receive that is as easy as possible. Something that can be repeated by newbies. For instance a config with a specific type of GPU with the right batch sized, the right preprocessing function and an example of jsonl datasetFor instance I am familiar with deepspeed, and already fine tuned a dozen of models with it, but it took me some time to fine tune with llm foundrySent from my iPhoneOn 11 May 2023, at 18:57, Alex Trott @.***> wrote: Thanks for helping to surface these issues!! @baptistejamin @arpitkk Did the explanation I posted above provide a useful intuition for how to set up the YAML? I want to make sure our README instructions are clear. I'll use your feedback to update them. I'll also aim to include some of the gotchas that you have caught, e.g., data_dir vs data_files, relative pathing.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
Hi, After running few batches the code is failing with the below errors:
IndexError: Caught IndexError in DataLoader worker process 6. Original Traceback (most recent call last): File "/home/anaconda3/envs/mpt-train/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/home/anaconda3/envs/mpt-train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch return self.collate_fn(data) File "/home/llm-foundry/llmfoundry/data/finetuning/collator.py", line 116, in call batch = self._process_and_batch_decoder_only(examples) File "/home/llm-foundry/llmfoundry/data/finetuning/collator.py", line 222, in _process_and_batch_decoder_only batch = self.tokenizer.pad( File "/home/anaconda3/envs/mpt-train/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2949, in pad if isinstance(encoded_inputs, (list, tuple)) and isinstance(encoded_inputs[0], Mapping): IndexError: list index out of range
I am using the below yaml file
max_seq_len: 2048 global_seed: 17
Run Name
run_name: # If left blank, will be read from env var $COMPOSER_RUN_NAME
Model
These must match pretraining
model: name: hf_causal_lm device: cuda:0 pretrained: true pretrained_model_name_or_path: mosaicml/mpt-7b
Tokenizer
tokenizer: name: EleutherAI/gpt-neox-20b kwargs: model_max_length: ${max_seq_len}
dataset: &hf_dataset hf_name: json hf_kwargs: data_files: train: /home/MPT-7B/mpt_dataset/mpt-train.jsonl test: /home/MPT-7B/mpt_dataset/mpt-test.jsonl
Dataloaders
train_loader: &train_loader
name: finetuning
dataset:
<<: *hf_dataset
split: train
max_seq_len: ${max_seq_len}
allow_pad_trimming: false
decoder_only_format: true
shuffle: true
# Use python llmfoundry/data/packing.py --yaml-path /path/to/this/yaml/ ... to profile
# this run's optimal packing_ratio
# packing_ratio:
drop_last: true
num_workers: 8
pin_memory: false
prefetch_factor: 2
persistent_workers: true
timeout: 0
eval_loader: <<: *train_loader dataset: <<: *hf_dataset split: test max_seq_len: ${max_seq_len} allow_pad_trimming: false decoder_only_format: true shuffle: false
Optimization
scheduler: name: linear_decay_with_warmup # linear no warmup is HF default which dolly used t_warmup: 0ba alpha_f: 0
optimizer:
mimic HF defaults to replicate dolly
name: decoupled_adamw lr: 1.0e-5 betas:
- 0.9
- 0.999 eps: 1.0e-8 weight_decay: 0
algorithms: gradient_clipping: clipping_type: norm clipping_threshold: 1.0
max_duration: 1ep eval_interval: 500ba eval_first: false eval_subset_num_batches: -1 global_train_batch_size: 2
System
seed: ${global_seed} device_eval_batch_size: 1 device_train_microbatch_size: 1
device_train_microbatch_size: auto
precision: amp_bf16
FSDP
fsdp_config: sharding_strategy: FULL_SHARD mixed_precision: PURE activation_checkpointing: false activation_checkpointing_reentrant: false activation_cpu_offload: false limit_all_gathers: true verbose: false
Logging
progress_bar: false log_to_console: true console_log_interval: 1ba
callbacks: speed_monitor: window_size: 10 lr_monitor: {} memory_monitor: {} runtime_estimator: {}
loggers:
wandb: {}
Checkpoint to local filesystem or remote object store
save_interval: 2000ba save_num_checkpoints_to_keep: 1 # Important, this cleans up checkpoints saved to DISK save_folder: ./llm_local_finetune/checkpoints
save_folder: s3://my-bucket/my-folder/{run_name}/checkpoints
Load from remote object store
REPLACE THE BELOW with you own checkpoint!
#load_path: oci://my-bucket/my-folder/mpt-7b/checkpoints/some_checkpoint.pt
The training is failing at 486 ba irrespective of which dataset I use and I check the if any empty inputs are getting passed to collator.py but it has data
[{'input_ids': [30003, 310, 271, 9775, 326, 8631, 247, 4836, 15, 19566, 247, 2380, 326, 20420, 29141, 253, 2748, 15, 187, 187, 4118, 41959, 27, 187, 2513, 627, 667, 1039, 281, 1721, 555, 14, 24382, 247, 40315, 8393, 32, 535, 187, 4118, 19371, 27, 187], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [8469, 20444, 20798, 476, 320, 1892, 281, 1721, 555, 14, 24382, 13, 1580, 597, 10748, 1453, 616, 2133, 3276, 285, 18676, 407, 1469, 281, 253, 15336, 347, 3058, 13, 2299, 352, 310, 1896, 949, 247, 9950, 3733, 1232, 15, 50276, 4943, 403, 690, 8521, 7259, 323, 5547, 3733, 27, 187, 187, 18, 15, 50276, 12864, 634, 40315, 8393, 715, 247, 1355, 13, 6537, 2317, 342, 521, 390, 617, 7583, 2739, 285, 23908, 13, 285, 1918, 634, 40315, 8393, 5044, 3733, 15, 187, 19, 15, 50276, 13502, 38529, 9848, 3057, 342, 634, 40315, 8393, 4768, 253, 3733, 1232, 15, 187, 20, 15, 50276, 30802, 472, 1721, 555, 3879, 342, 1355, 26574, 13, 2739, 285, 1132, 2606, 15, 9225, 504, 3081, 26574, 846, 1016, 5547, 3733, 6874, 15, 209, 187, 21, 15, 50276, 5279, 673, 13, 13237, 2572, 253, 673, 634, 40315, 8393, 310, 7591, 281, 2289, 521, 390, 617, 1721, 555, 2170, 15, 50276, 21914, 326, 253, 3733, 1232, 3936, 673, 13, 594, 22450, 285, 5185, 49495, 403, 253, 2234, 281, 2323, 15, 209, 187, 22, 15, 50276, 29146, 1230, 4575, 634, 40315, 8393, 434, 6196, 281, 359, 266, 779, 390, 617, 432, 3081, 26574, 285, 6558, 253, 6799, 3879, 15, 187, 23, 15, 50276, 16628, 5277, 634, 40315, 8393, 342, 24443, 4158, 30653, 281, 1361, 731, 755, 9848, 342, 970, 247, 1721, 555, 275, 253, 987, 4328, 15, 50276, 187, 187, 1231, 671, 971, 281, 22175, 326, 40315, 20798, 403, 1355, 285, 28304, 13, 285, 2430, 2714, 1557, 285, 10885, 15, 50276, 6693, 187]}]
Hi Team, I have identified the issue there was problem with batch size.. now its working fine.. thanks for support !!
I tried fine tuning mpt7b using dolly dataset. Using below command: composer train.py yamls/finetune/mpt-7b_dolly_sft.yaml
Before strating training i am getting below error:
[Eval batch=321/321] Eval on eval data:
Eval metrics/eval/LanguageCrossEntropy: 9.1594
Eval metrics/eval/LanguagePerplexity: 9503.6523
/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/utils/data/dataloader.py:554: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
Traceback (most recent call last):
File "
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/stsingha/LLM/llm-foundry/scripts/train/train.py", line 254, in
Could you please help in this issue. @arpitkk @baptistejamin @alextrott16
see https://github.com/mosaicml/llm-foundry/issues/143#issuecomment-1553334904
Since this issue was originally made to address the lack of clarity around finetuning from a local dataset, I just want to let folks know that we just pushed a PR that includes a much more concrete example of this workflow.
In the scripts/train directory, you'll find finetune_example, which includes:
- a detailed README
- an example local training dataset
- an implementation of a preprocessing function for that dataset
- a YAML which puts it all together and can be run locally via
train.py
To help us stay on top of other issues, I'll close this one. If things remain unclear, feel free to add another comment and I'll re-open the issue if necessary. Thank you!
Hi Team, I have identified the issue there was problem with batch size.. now its working fine.. thanks for support !!
@arpitkk can you explain what you changed to get this working? I am running into the same issue.
Hi Team, I have identified the issue there was problem with batch size.. now its working fine.. thanks for support !!
@arpitkk can you explain what you changed to get this working? I am running into the same issue.
I am facing the same issue as well. @arpitkk can you please share as to what you changed to get this working?