llm-foundry
llm-foundry copied to clipboard
fine tuning mpt7b using local dataset
I tried fine tuning mpt7b using dolly dataset. Using below command:
composer train.py yamls/finetune/mpt-7b_dolly_sft.yaml
yaml file: https://github.com/mosaicml/llm-foundry/blob/main/scripts/train/yamls/finetune/mpt-7b_dolly_sft.yaml
Before strating training i am getting below error:
[Eval batch=321/321] Eval on eval data: Eval metrics/eval/LanguageCrossEntropy: 9.1594 Eval metrics/eval/LanguagePerplexity: 9503.6523 /home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/utils/data/dataloader.py:554: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( Traceback (most recent call last): File "", line 21, in _bwd_kernel KeyError: ('2-.-0-.-0-842f0fbd42a6607893f7134cdd9d16f2-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-f24b6aa9b101a518b6a4a6bddded372e-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.float32, torch.bfloat16, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('vector', True, 128, False, True, True, True, 128, 128), (True, True, True, True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False)))
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/stsingha/LLM/llm-foundry/scripts/train/train.py", line 254, in main(cfg) File "/home/stsingha/LLM/llm-foundry/scripts/train/train.py", line 243, in main trainer.fit() File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 1766, in fit self._train_loop() File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 1940, in _train_loop total_loss_dict = self._train_batch(use_grad_scaling) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2115, in _train_batch optimizer.step(closure=lambda **kwargs: self._train_microbatches( File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper return wrapped(*args, **kwargs) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/optim/optimizer.py", line 140, in wrapper out = func(*args, **kwargs) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/optim/decoupled_weight_decay.py", line 288, in step loss = closure() File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2115, in optimizer.step(closure=lambda **kwargs: self._train_microbatches( File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2213, in _train_microbatches microbatch_loss_dict = self._train_microbatch(use_grad_scaling, current_batch_size, is_final_microbatch) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/composer/trainer/trainer.py", line 2340, in _train_microbatch microbatch_loss.backward(create_graph=self._backwards_create_graph) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/autograd/function.py", line 267, in apply return user_fn(self, *args) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/flash_attn/flash_attn_triton.py", line 827, in backward _flash_attn_backward(do, q, k, v, o, lse, dq, dk, dv, File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/flash_attn/flash_attn_triton.py", line 694, in _flash_attn_backward _bwd_kernel[grid]( File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/jit.py", line 106, in launcher return self.run(*args, grid=grid, **kwargs) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 73, in run timings = {config: self._bench(*args, config=config, **kwargs) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 73, in timings = {config: self._bench(*args, config=config, **kwargs) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 63, in _bench return do_bench(kernel_call) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/testing.py", line 140, in do_bench fn() File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 62, in kernel_call self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **current) File "/home/stsingha/LLM/llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 200, in run return self.fn.run(*args, **kwargs) File "", line 43, in _bwd_kernel RuntimeError: Triton Error [CUDA]: invalid argument
I'm also getting the same problem too, using triton 2.0.0.dev20221202 as recommended in the setup script.
I am also seeing this problem.
What kind of hardware are you using? And have you tried starting from our recommended docker image mosaicml/pytorch:1.13.1_cu117-python3.10-ubuntu20.04?
Any other details about your environments would be helpful to know.
@alextrott16
I am using slurm cluster with 4 A10 GPUs.
Cuda Version : 11.6
nvcc version : Cuda compilation tools, release 11.6, V11.6.55
Build cuda_11.6.r11.6/compiler.30794723_0
GCC version : gcc (GCC) 7.3.1 20180303
torch : 1.13.1+cu116
Along with the above error when i am trying multinode i am getting nccl error
Also i am getting below error with 'attn_impl: torch' : File "llm-foundry/llmfoundry-venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/d8304854d4877849c3c0a78f3469512a84419e84/modeling_mpt.py", line 142, in forward raise NotImplementedError('MPT does not support training with left padding.')
What kind of hardware are you using? And have you tried starting from our recommended docker image
mosaicml/pytorch:1.13.1_cu117-python3.10-ubuntu20.04?Any other details about your environments would be helpful to know.
Looks like the error is with torch1.13.1+cu116 version. and with torch2.0.1 getting error of saving checkpoint. Which torch version should be used for cuda 11.6?
KeyError: ('2-.-0-.-0-842f0fbd42a6607893f7134cdd9d16f2-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-f24b6aa9b101a518b6a4a6bddded372e-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.float32, torch.bfloat16, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('vector', True, 128, False, True, True, True, 128, 128), (True, True, True, True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False)))
Is an error you see if you try to use torch>=2.0.0 (torch>=2.0.0 requires a version of triton that has issues)
We are working on a workaround.
This error tells you the issue. Your dataset is outputting data with left padding. MPT does not support training with left padding. This is a dataset issue.
This question may sound a bit silly, but why is right padding used during training while left padding is chosen during inference?
I'm getting the same kernel crash / key error:
[Eval batch=1/13] Eval on eval data
[Eval batch=2/13] Eval on eval data
[Eval batch=3/13] Eval on eval data
[Eval batch=5/13] Eval on eval data
[Eval batch=6/13] Eval on eval data
[Eval batch=7/13] Eval on eval data
[Eval batch=8/13] Eval on eval data
[Eval batch=9/13] Eval on eval data
[Eval batch=11/13] Eval on eval data
[Eval batch=12/13] Eval on eval data
[Eval batch=13/13] Eval on eval data:
Eval metrics/eval/LanguageCrossEntropy: 10.0889
Eval metrics/eval/LanguagePerplexity: 24073.9668
Traceback (most recent call last):
File "<string>", line 21, in _bwd_kernel
KeyError: ('2-.-0-.-0--2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.bfloat16, torch.float32, torch.bfloat16, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('vector', True, 128, False, True, True, True, 128, 128), (True, True, True, True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False)))
....
File "/home/ubuntu/.local/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 200, in run
return self.fn.run(*args, **kwargs)
File "<string>", line 43, in _bwd_kernel
RuntimeError: Triton Error [CUDA]: invalid argument
Using a g5.24xlarge instance with 4xA10G GPUs on EC2. It's using Torch 1.13.1 already:
ubuntu@ip-172-31-12-71:/opt/mpt-7b/llm-foundry/scripts/train$ pip uninstall torch
Found existing installation: torch 1.13.1
And triton 2.0.0.dev20221202:
ubuntu@ip-172-31-12-71:/opt/mpt-7b/llm-foundry/scripts/train$ pip uninstall triton
Found existing installation: triton 2.0.0.dev20221202
Hi @jwatte , could you try installing this fork of triton we have setup? It uses a pre-MLIR tag that should work with torch 1.13.1: https://github.com/mosaicml/llm-foundry/blob/3c66b1c5df668e0684548fef30d00669df64636c/setup.py#L63
In general we have not tried training with A10s so it's a bit of uncharted territory. I hope we can get more internally so we can start adding it to our support matrix, but it's unlikely to happen in the next few weeks.
This question may sound a bit silly, but why is right padding used during training while left padding is chosen during inference?
I think the choice at training time is a bit arbitrary, but at inference time, left padding is used so that the ends of sequences line up, since you generate 1 token at a time, you want to make sure the new tokens are "lined up".
Closing this issue as it's gone a bit stale, but I just want to note that we are actively testing A10 support now and will update the support matrix on the top README once we have confirmed that it works.