llm-foundry fine tuning mpt7b using local dataset

I tried fine tuning mpt7b using dolly dataset. Using below command:

composer train.py yamls/finetune/mpt-7b_dolly_sft.yaml

yaml file: https://github.com/mosaicml/llm-foundry/blob/main/scripts/train/yamls/finetune/mpt-7b_dolly_sft.yaml

Before strating training i am getting below error:

[Eval batch=321/321] Eval on eval data: Eval metrics/eval/LanguageCrossEntropy: 9.1594 Eval metrics/eval/LanguagePerplexity: 9503.6523 /home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/utils/data/dataloader.py:554: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( Traceback (most recent call last): File "", line 21, in _bwd_kernel KeyError: ('2-.-0-.-0-842f0fbd42a6607893f7134cdd9d16f2-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-f24b6aa9b101a518b6a4a6bddded372e-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.float32, torch.bfloat16, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('vector', True, 128, False, True, True, True, 128, 128), (True, True, True, True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/stsingha/LLM/llm-foundry/scripts/train/train.py", line 254, in main(cfg) File "/home/stsingha/LLM/llm-foundry/scripts/train/train.py", line 243, in main trainer.fit() File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 1766, in fit self._train_loop() File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 1940, in _train_loop total_loss_dict = self._train_batch(use_grad_scaling) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2115, in _train_batch optimizer.step(closure=lambda **kwargs: self._train_microbatches( File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper return wrapped(*args, **kwargs) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/optim/optimizer.py", line 140, in wrapper out = func(*args, **kwargs) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/optim/decoupled_weight_decay.py", line 288, in step loss = closure() File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2115, in optimizer.step(closure=lambda **kwargs: self._train_microbatches( File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2213, in _train_microbatches microbatch_loss_dict = self._train_microbatch(use_grad_scaling, current_batch_size, is_final_microbatch) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2340, in _train_microbatch microbatch_loss.backward(create_graph=self._backwards_create_graph) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/autograd/function.py", line 267, in apply return user_fn(self, *args) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/flash_attn/flash_attn_triton.py", line 827, in backward _flash_attn_backward(do, q, k, v, o, lse, dq, dk, dv, File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/flash_attn/flash_attn_triton.py", line 694, in _flash_attn_backward _bwd_kernel[grid]( File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/jit.py", line 106, in launcher return self.run(*args, grid=grid, **kwargs) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 73, in run timings = {config: self._bench(*args, config=config, **kwargs) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 73, in timings = {config: self._bench(*args, config=config, **kwargs) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 63, in _bench return do_bench(kernel_call) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/testing.py", line 140, in do_bench fn() File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 62, in kernel_call self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **current) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 200, in run return self.fn.run(*args, **kwargs) File "", line 43, in _bwd_kernel RuntimeError: Triton Error [CUDA]: invalid argument

May 16 '23 12:05 singhalshikha518

I'm also getting the same problem too, using triton 2.0.0.dev20221202 as recommended in the setup script.

May 17 '23 07:05 Paladiamors

I am also seeing this problem.

May 17 '23 14:05 fbiere

What kind of hardware are you using? And have you tried starting from our recommended docker image mosaicml/pytorch:1.13.1_cu117-python3.10-ubuntu20.04?

Any other details about your environments would be helpful to know.

May 17 '23 19:05 alextrott16

@alextrott16
I am using slurm cluster with 4 A10 GPUs. Cuda Version : 11.6 nvcc version : Cuda compilation tools, release 11.6, V11.6.55 Build cuda_11.6.r11.6/compiler.30794723_0

GCC version : gcc (GCC) 7.3.1 20180303

torch : 1.13.1+cu116

Along with the above error when i am trying multinode i am getting nccl error

May 18 '23 05:05 singhalshikha518

Also i am getting below error with 'attn_impl: torch' : File "llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/d8304854d4877849c3c0a78f3469512a84419e84/modeling_mpt.py", line 142, in forward raise NotImplementedError('MPT does not support training with left padding.')

May 18 '23 07:05 singhalshikha518

What kind of hardware are you using? And have you tried starting from our recommended docker image mosaicml/pytorch:1.13.1_cu117-python3.10-ubuntu20.04?

Any other details about your environments would be helpful to know.

Looks like the error is with torch1.13.1+cu116 version. and with torch2.0.1 getting error of saving checkpoint. Which torch version should be used for cuda 11.6?

May 18 '23 15:05 singhalshikha518

KeyError: ('2-.-0-.-0-842f0fbd42a6607893f7134cdd9d16f2-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-f24b6aa9b101a518b6a4a6bddded372e-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.float32, torch.bfloat16, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('vector', True, 128, False, True, True, True, 128, 128), (True, True, True, True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False)))

Is an error you see if you try to use torch>=2.0.0 (torch>=2.0.0 requires a version of triton that has issues) We are working on a workaround.

May 18 '23 16:05 vchiley

This error tells you the issue. Your dataset is outputting data with left padding. MPT does not support training with left padding. This is a dataset issue.

May 18 '23 16:05 vchiley

This question may sound a bit silly, but why is right padding used during training while left padding is chosen during inference?

May 19 '23 05:05 Louis-y-nlp

I'm getting the same kernel crash / key error:

[Eval batch=1/13] Eval on eval data
[Eval batch=2/13] Eval on eval data
[Eval batch=3/13] Eval on eval data
[Eval batch=5/13] Eval on eval data
[Eval batch=6/13] Eval on eval data
[Eval batch=7/13] Eval on eval data
[Eval batch=8/13] Eval on eval data
[Eval batch=9/13] Eval on eval data
[Eval batch=11/13] Eval on eval data
[Eval batch=12/13] Eval on eval data
[Eval batch=13/13] Eval on eval data:
         Eval metrics/eval/LanguageCrossEntropy: 10.0889
         Eval metrics/eval/LanguagePerplexity: 24073.9668
Traceback (most recent call last):
  File "<string>", line 21, in _bwd_kernel
KeyError: ('2-.-0-.-0--2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.float32, torch.bfloat16, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('vector', True, 128, False, True, True, True, 128, 128), (True, True, True, True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False)))

....

  File "/home/ubuntu/.local/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 200, in run
    return self.fn.run(*args, **kwargs)
  File "<string>", line 43, in _bwd_kernel
RuntimeError: Triton Error [CUDA]: invalid argument

Using a g5.24xlarge instance with 4xA10G GPUs on EC2. It's using Torch 1.13.1 already:

ubuntu@ip-172-31-12-71:/opt/mpt-7b/llm-foundry/scripts/train$ pip uninstall torch
Found existing installation: torch 1.13.1

And triton 2.0.0.dev20221202:

ubuntu@ip-172-31-12-71:/opt/mpt-7b/llm-foundry/scripts/train$ pip uninstall triton
Found existing installation: triton 2.0.0.dev20221202

May 24 '23 19:05 jwatte

Hi @jwatte , could you try installing this fork of triton we have setup? It uses a pre-MLIR tag that should work with torch 1.13.1: https://github.com/mosaicml/llm-foundry/blob/3c66b1c5df668e0684548fef30d00669df64636c/setup.py#L63

In general we have not tried training with A10s so it's a bit of uncharted territory. I hope we can get more internally so we can start adding it to our support matrix, but it's unlikely to happen in the next few weeks.

This question may sound a bit silly, but why is right padding used during training while left padding is chosen during inference?

I think the choice at training time is a bit arbitrary, but at inference time, left padding is used so that the ends of sequences line up, since you generate 1 token at a time, you want to make sure the new tokens are "lined up".

May 31 '23 01:05 abhi-mosaic

Closing this issue as it's gone a bit stale, but I just want to note that we are actively testing A10 support now and will update the support matrix on the top README once we have confirmed that it works.

Jun 13 '23 16:06 abhi-mosaic

llm-foundry llm-foundry copied to clipboard

fine tuning mpt7b using local dataset

llm-foundry
llm-foundry copied to clipboard