Hello Team,

Can you please guide on how to finetune on local datasets, the instructions given in scripts/train are not so clear. Below yaml file was given as sample example:

train_loader: name: finetuning dataset: hf_name: my-local-dataset hf_kwargs: data_files: train: /path/to/train.jsonl preprocessing_fn: my.import.path:my_preprocessing_fn split: train

So do we need to create a loading_script to load local dataset using huggingface datasets or is there a way we can directly use jsonl file paths instead of converting it into huggingface dataset & for that what changes I need to make in yaml file

May 10 '23 13:05 arpitkk

I tried doing the same, and I do agree the instructions are not clear at all.

May 10 '23 16:05 baptistejamin

Sorry for the lack of clarity here, we'll update the docs.

Going off of @arpitkk 's example, you would want to use train_loader.dataset.hf_name: json and keep the rest the same.

Just to clarify, that config will eventually influence the behavior of this dataset-building method here:https://github.com/mosaicml/llm-foundry/blob/6c16a6e9b31abe179dda7d71e9601f866721d6fa/llmfoundry/data/finetuning/tasks.py#L223

So, by setting your config to

train_loader:
   name: finetuning
   dataset:
      hf_name: json
      hf_kwargs:
         data_files:
            train: /path/to/train.jsonl
      preprocessing_fn: my.import.path:my_preprocessing_fn
      split: train

The build_from_hf method will effectively execute the following code:

import datasets
from my.import.path import my_preprocessing_fn

dataset = datasets.load_dataset('json', split='train',  data_files={'train': '/path/to/train.jsonl'})

def dataset_mapper(example: Dict):
    example = my_preprocessing_fn(example)
    return _tokenize_formatted_example(example, tokenizer)

columns_to_remove = list(dataset[0].keys())
tokenized_dataset = dataset.map(
    dataset_mapper,
    batched=False,
    remove_columns=columns_to_remove,
)

Does that help to clear things up? If so, I can work a similar explanation into the README. If not, please let me know what still feels unclear!

May 10 '23 23:05 alextrott16

There might be an issue then. Here is the config I am using:

train_loader:
  name: finetuning

  dataset:
    hf_name: json
    hf_kwargs:
       data_files:
          train: /mnt/training/mylocaldataset/train.jsonl
    preprocessing_fn: mylocaldataset.utils:prep_fn
    split: train
    max_seq_len: ${max_seq_len}
    allow_pad_trimming: false
    decoder_only_format: true
    shuffle: true

I am running with: composer llm-foundry/scripts/train/train.py mpt.yml save_folder=mpt-tuned at the path /mnt/training.

wc -l /mnt/training/mylocaldataset/train.jsonl: 1212821

I will debug this today and let you know if I make some progress :)

May 11 '23 06:05 baptistejamin

Issue found!

It seems there are a couple of bugs in the HuggingFace library. First, due to a regex problem the system is mixing the train.jsonl and the split train.

The dataset should not be named train.jsonl. Name it for instance prompts.jsonl.

Also, I now use data_dir instead and it works fine this way.

dataset:
    hf_name: json
    hf_kwargs:
      keep_in_memory: true
      data_dir: /mnt/training/mylocaldataset
    preprocessing_fn: mylocaldataset.utils:prep_fn

May 11 '23 08:05 baptistejamin

@baptistejamin, do you have to specify the splits? so data_dir is only the dir? how would it find prompt.jsonl then?

May 11 '23 11:05 wj210

i do not know why, but i keep getting error "FileNotFoundError: Unable to find '/workspace/scripts/train/train' at /workspace/scripts/train" at dataset = datasets.load_dataset(dataset_name, split=split, **kwargs) my yaml is # Dataloaders train_loader: name: finetuning dataset: hf_name: json hf_kwargs: data_files: train: data/train.json preprocessing_fn: preprocess_investopedia:preprocess_investopedia

the data folder is inside the current working dir.

May 11 '23 11:05 wj210

Yes, keep the spit. I strongly recommend using data_dir rather than data_files. Keep the same config, but replace data_files with data_dir

May 11 '23 12:05 baptistejamin

The global batch size is too high. try with a batch size of 1, and then increase it until OOM

May 11 '23 16:05 baptistejamin

Thanks for helping to surface these issues!!

@baptistejamin @arpitkk Did the explanation I posted above provide a useful intuition for how to set up the YAML? I want to make sure our README instructions are clear. I'll use your feedback to update them.

I'll also aim to include some of the gotchas that you have caught, e.g., data_dir vs data_files, relative pathing.

May 11 '23 16:05 alextrott16

You should make a receive that is as easy as possible. Something that can be repeated by newbies. For instance a config with a specific type of GPU with the right batch sized, the right preprocessing function and an example of jsonl datasetFor instance I am familiar with deepspeed, and already fine tuned a dozen of models with it, but it took me some time to fine tune with llm foundrySent from my iPhoneOn 11 May 2023, at 18:57, Alex Trott @.***> wrote: Thanks for helping to surface these issues!! @baptistejamin @arpitkk Did the explanation I posted above provide a useful intuition for how to set up the YAML? I want to make sure our README instructions are clear. I'll use your feedback to update them. I'll also aim to include some of the gotchas that you have caught, e.g., data_dir vs data_files, relative pathing.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

May 11 '23 17:05 baptistejamin

Hi, After running few batches the code is failing with the below errors:

IndexError: Caught IndexError in DataLoader worker process 6. Original Traceback (most recent call last): File "/home/anaconda3/envs/mpt-train/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/home/anaconda3/envs/mpt-train/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch return self.collate_fn(data) File "/home/llm-foundry/llmfoundry/data/finetuning/collator.py", line 116, in call batch = self._process_and_batch_decoder_only(examples) File "/home/llm-foundry/llmfoundry/data/finetuning/collator.py", line 222, in _process_and_batch_decoder_only batch = self.tokenizer.pad( File "/home/anaconda3/envs/mpt-train/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2949, in pad if isinstance(encoded_inputs, (list, tuple)) and isinstance(encoded_inputs[0], Mapping): IndexError: list index out of range

I am using the below yaml file

max_seq_len: 2048 global_seed: 17

Run Name

run_name: # If left blank, will be read from env var $COMPOSER_RUN_NAME

Model

These must match pretraining

model: name: hf_causal_lm device: cuda:0 pretrained: true pretrained_model_name_or_path: mosaicml/mpt-7b

Tokenizer

tokenizer: name: EleutherAI/gpt-neox-20b kwargs: model_max_length: ${max_seq_len}

dataset: &hf_dataset hf_name: json hf_kwargs: data_files: train: /home/MPT-7B/mpt_dataset/mpt-train.jsonl test: /home/MPT-7B/mpt_dataset/mpt-test.jsonl

Dataloaders

train_loader: &train_loader name: finetuning dataset: <<: *hf_dataset split: train max_seq_len: ${max_seq_len} allow_pad_trimming: false decoder_only_format: true shuffle: true # Use python llmfoundry/data/packing.py --yaml-path /path/to/this/yaml/ ... to profile # this run's optimal packing_ratio # packing_ratio: drop_last: true num_workers: 8 pin_memory: false prefetch_factor: 2 persistent_workers: true timeout: 0

eval_loader: <<: *train_loader dataset: <<: *hf_dataset split: test max_seq_len: ${max_seq_len} allow_pad_trimming: false decoder_only_format: true shuffle: false

Optimization

scheduler: name: linear_decay_with_warmup # linear no warmup is HF default which dolly used t_warmup: 0ba alpha_f: 0

optimizer:

mimic HF defaults to replicate dolly

name: decoupled_adamw lr: 1.0e-5 betas:

0.9
0.999 eps: 1.0e-8 weight_decay: 0

algorithms: gradient_clipping: clipping_type: norm clipping_threshold: 1.0

max_duration: 1ep eval_interval: 500ba eval_first: false eval_subset_num_batches: -1 global_train_batch_size: 2

System

seed: ${global_seed} device_eval_batch_size: 1 device_train_microbatch_size: 1

device_train_microbatch_size: auto

precision: amp_bf16

FSDP

fsdp_config: sharding_strategy: FULL_SHARD mixed_precision: PURE activation_checkpointing: false activation_checkpointing_reentrant: false activation_cpu_offload: false limit_all_gathers: true verbose: false

Logging

progress_bar: false log_to_console: true console_log_interval: 1ba

callbacks: speed_monitor: window_size: 10 lr_monitor: {} memory_monitor: {} runtime_estimator: {}

loggers:

wandb: {}

Checkpoint to local filesystem or remote object store

save_interval: 2000ba save_num_checkpoints_to_keep: 1 # Important, this cleans up checkpoints saved to DISK save_folder: ./llm_local_finetune/checkpoints

save_folder: s3://my-bucket/my-folder/{run_name}/checkpoints

Load from remote object store

REPLACE THE BELOW with you own checkpoint!

#load_path: oci://my-bucket/my-folder/mpt-7b/checkpoints/some_checkpoint.pt

The training is failing at 486 ba irrespective of which dataset I use and I check the if any empty inputs are getting passed to collator.py but it has data

[{'input_ids': [30003, 310, 271, 9775, 326, 8631, 247, 4836, 15, 19566, 247, 2380, 326, 20420, 29141, 253, 2748, 15, 187, 187, 4118, 41959, 27, 187, 2513, 627, 667, 1039, 281, 1721, 555, 14, 24382, 247, 40315, 8393, 32, 535, 187, 4118, 19371, 27, 187], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [8469, 20444, 20798, 476, 320, 1892, 281, 1721, 555, 14, 24382, 13, 1580, 597, 10748, 1453, 616, 2133, 3276, 285, 18676, 407, 1469, 281, 253, 15336, 347, 3058, 13, 2299, 352, 310, 1896, 949, 247, 9950, 3733, 1232, 15, 50276, 4943, 403, 690, 8521, 7259, 323, 5547, 3733, 27, 187, 187, 18, 15, 50276, 12864, 634, 40315, 8393, 715, 247, 1355, 13, 6537, 2317, 342, 521, 390, 617, 7583, 2739, 285, 23908, 13, 285, 1918, 634, 40315, 8393, 5044, 3733, 15, 187, 19, 15, 50276, 13502, 38529, 9848, 3057, 342, 634, 40315, 8393, 4768, 253, 3733, 1232, 15, 187, 20, 15, 50276, 30802, 472, 1721, 555, 3879, 342, 1355, 26574, 13, 2739, 285, 1132, 2606, 15, 9225, 504, 3081, 26574, 846, 1016, 5547, 3733, 6874, 15, 209, 187, 21, 15, 50276, 5279, 673, 13, 13237, 2572, 253, 673, 634, 40315, 8393, 310, 7591, 281, 2289, 521, 390, 617, 1721, 555, 2170, 15, 50276, 21914, 326, 253, 3733, 1232, 3936, 673, 13, 594, 22450, 285, 5185, 49495, 403, 253, 2234, 281, 2323, 15, 209, 187, 22, 15, 50276, 29146, 1230, 4575, 634, 40315, 8393, 434, 6196, 281, 359, 266, 779, 390, 617, 432, 3081, 26574, 285, 6558, 253, 6799, 3879, 15, 187, 23, 15, 50276, 16628, 5277, 634, 40315, 8393, 342, 24443, 4158, 30653, 281, 1361, 731, 755, 9848, 342, 970, 247, 1721, 555, 275, 253, 987, 4328, 15, 50276, 187, 187, 1231, 671, 971, 281, 22175, 326, 40315, 20798, 403, 1355, 285, 28304, 13, 285, 2430, 2714, 1557, 285, 10885, 15, 50276, 6693, 187]}]

May 14 '23 15:05 arpitkk

Hi Team, I have identified the issue there was problem with batch size.. now its working fine.. thanks for support !!

May 15 '23 08:05 arpitkk

I tried fine tuning mpt7b using dolly dataset. Using below command: composer train.py yamls/finetune/mpt-7b_dolly_sft.yaml

Before strating training i am getting below error:

[Eval batch=321/321] Eval on eval data: Eval metrics/eval/LanguageCrossEntropy: 9.1594 Eval metrics/eval/LanguagePerplexity: 9503.6523 /home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/utils/data/dataloader.py:554: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( Traceback (most recent call last): File "", line 21, in _bwd_kernel KeyError: ('2-.-0-.-0-842f0fbd42a6607893f7134cdd9d16f2-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-f24b6aa9b101a518b6a4a6bddded372e-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.float32, torch.bfloat16, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('vector', True, 128, False, True, True, True, 128, 128), (True, True, True, True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/stsingha/LLM/llm-foundry/scripts/train/train.py", line 254, in main(cfg) File "/home/stsingha/LLM/llm-foundry/scripts/train/train.py", line 243, in main trainer.fit() File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 1766, in fit self._train_loop() File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 1940, in _train_loop total_loss_dict = self._train_batch(use_grad_scaling) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2115, in _train_batch optimizer.step(closure=lambda **kwargs: self._train_microbatches( File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper return wrapped(*args, **kwargs) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/optim/optimizer.py", line 140, in wrapper out = func(*args, **kwargs) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/optim/decoupled_weight_decay.py", line 288, in step loss = closure() File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2115, in optimizer.step(closure=lambda **kwargs: self._train_microbatches( File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2213, in _train_microbatches microbatch_loss_dict = self._train_microbatch(use_grad_scaling, current_batch_size, is_final_microbatch) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2340, in _train_microbatch microbatch_loss.backward(create_graph=self._backwards_create_graph) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/autograd/function.py", line 267, in apply return user_fn(self, *args) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/flash_attn/flash_attn_triton.py", line 827, in backward _flash_attn_backward(do, q, k, v, o, lse, dq, dk, dv, File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/flash_attn/flash_attn_triton.py", line 694, in _flash_attn_backward _bwd_kernel[grid]( File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/jit.py", line 106, in launcher return self.run(*args, grid=grid, **kwargs) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 73, in run timings = {config: self._bench(*args, config=config, **kwargs) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 73, in timings = {config: self._bench(*args, config=config, **kwargs) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 63, in _bench return do_bench(kernel_call) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/testing.py", line 140, in do_bench fn() File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 62, in kernel_call self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **current) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 200, in run return self.fn.run(*args, **kwargs) File "", line 43, in _bwd_kernel RuntimeError: Triton Error [CUDA]: invalid argument

Could you please help in this issue. @arpitkk @baptistejamin @alextrott16

May 16 '23 11:05 singhalshikha518

see https://github.com/mosaicml/llm-foundry/issues/143#issuecomment-1553334904

May 18 '23 18:05 vchiley

Since this issue was originally made to address the lack of clarity around finetuning from a local dataset, I just want to let folks know that we just pushed a PR that includes a much more concrete example of this workflow.

In the scripts/train directory, you'll find finetune_example, which includes:

a detailed README
an example local training dataset
an implementation of a preprocessing function for that dataset
a YAML which puts it all together and can be run locally via train.py

To help us stay on top of other issues, I'll close this one. If things remain unclear, feel free to add another comment and I'll re-open the issue if necessary. Thank you!

May 18 '23 21:05 alextrott16

Hi Team, I have identified the issue there was problem with batch size.. now its working fine.. thanks for support !!

@arpitkk can you explain what you changed to get this working? I am running into the same issue.

Jun 09 '23 19:06 zacharyblank

Hi Team, I have identified the issue there was problem with batch size.. now its working fine.. thanks for support !!

@arpitkk can you explain what you changed to get this working? I am running into the same issue.

I am facing the same issue as well. @arpitkk can you please share as to what you changed to get this working?

Jun 13 '23 06:06 varunnathan

llm-foundry
llm-foundry copied to clipboard

Finetune MPT models with local dataset

Run Name

Model

These must match pretraining

Tokenizer

Dataloaders

Optimization

mimic HF defaults to replicate dolly

System

device_train_microbatch_size: auto

FSDP

Logging

loggers:

wandb: {}

Checkpoint to local filesystem or remote object store

save_folder: s3://my-bucket/my-folder/{run_name}/checkpoints

Load from remote object store

REPLACE THE BELOW with you own checkpoint!

llm-foundry llm-foundry copied to clipboard

Finetune MPT models with local dataset

Run Name

Model

These must match pretraining

Tokenizer

Dataloaders

Optimization

mimic HF defaults to replicate dolly

System

device_train_microbatch_size: auto

FSDP

Logging

loggers:

wandb: {}

Checkpoint to local filesystem or remote object store

save_folder: s3://my-bucket/my-folder/{run_name}/checkpoints

Load from remote object store

REPLACE THE BELOW with you own checkpoint!

llm-foundry
llm-foundry copied to clipboard