h2o-llmstudio icon indicating copy to clipboard operation
h2o-llmstudio copied to clipboard

[BUG] Int8 finetuning throwing a type error

Open binga opened this issue 2 years ago • 3 comments

🐛 Bug

Int8 finetuning throwing a type error

I'm trying to finetune EleutherAI/pythia-2.8b-deduped model with oasst dataset on a machine with 8 V100 GPUs.

Only the following parameters are changed with others being default:

  1. Backbone DType: int8
  2. Batch Size: 3
  3. Epochs: 3

The error I see is:

RuntimeError: expected scalar type Half but found Float

Stack trace:

2023-06-26 10:11:09,750 - INFO: Added key: store_based_barrier_key:1 to store for rank: 3 2023-06-26 10:11:09,757 - INFO: Added key: store_based_barrier_key:1 to store for rank: 0 2023-06-26 10:11:09,788 - INFO: Added key: store_based_barrier_key:1 to store for rank: 5 2023-06-26 10:11:09,831 - INFO: Added key: store_based_barrier_key:1 to store for rank: 1 2023-06-26 10:11:09,835 - INFO: Added key: store_based_barrier_key:1 to store for rank: 7 2023-06-26 10:11:09,837 - INFO: Added key: store_based_barrier_key:1 to store for rank: 6 2023-06-26 10:11:09,861 - INFO: Added key: store_based_barrier_key:1 to store for rank: 2 2023-06-26 10:11:09,904 - INFO: Added key: store_based_barrier_key:1 to store for rank: 4 2023-06-26 10:11:09,904 - INFO: Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 2023-06-26 10:11:09,905 - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 2023-06-26 10:11:09,907 - INFO: Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 2023-06-26 10:11:09,909 - INFO: Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 2023-06-26 10:11:09,911 - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 2023-06-26 10:11:09,911 - INFO: Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 2023-06-26 10:11:09,912 - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 2023-06-26 10:11:09,913 - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. 2023-06-26 10:11:09,997 - INFO: Added key: store_based_barrier_key:2 to store for rank: 0 2023-06-26 10:11:10,007 - INFO: Added key: store_based_barrier_key:2 to store for rank: 1 2023-06-26 10:11:10,017 - INFO: Added key: store_based_barrier_key:2 to store for rank: 3 2023-06-26 10:11:10,017 - INFO: Added key: store_based_barrier_key:2 to store for rank: 5 2023-06-26 10:11:10,017 - INFO: Added key: store_based_barrier_key:2 to store for rank: 2 2023-06-26 10:11:10,018 - INFO: Added key: store_based_barrier_key:2 to store for rank: 4 2023-06-26 10:11:10,027 - INFO: Added key: store_based_barrier_key:2 to store for rank: 7 2023-06-26 10:11:10,028 - INFO: Added key: store_based_barrier_key:2 to store for rank: 6 2023-06-26 10:11:10,028 - INFO: Rank 6: Completed store-based barrier for key:store_based_barrier_key:2 with 8 nodes. 2023-06-26 10:11:10,028 - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 8 nodes. 2023-06-26 10:11:10,028 - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:2 with 8 nodes. 2023-06-26 10:11:10,028 - INFO: Rank 4: Completed store-based barrier for key:store_based_barrier_key:2 with 8 nodes. 2023-06-26 10:11:10,028 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 6, total: 8 local rank: 6. 2023-06-26 10:11:10,028 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total: 8 local rank: 0. 2023-06-26 10:11:10,028 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 2, total: 8 local rank: 2. 2023-06-26 10:11:10,028 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 4, total: 8 local rank: 4. 2023-06-26 10:11:10,037 - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:2 with 8 nodes. 2023-06-26 10:11:10,037 - INFO: Rank 5: Completed store-based barrier for key:store_based_barrier_key:2 with 8 nodes. 2023-06-26 10:11:10,038 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 3, total: 8 local rank: 3. 2023-06-26 10:11:10,038 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 5, total: 8 local rank: 5. 2023-06-26 10:11:10,038 - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 8 nodes. 2023-06-26 10:11:10,038 - INFO: Rank 7: Completed store-based barrier for key:store_based_barrier_key:2 with 8 nodes. 2023-06-26 10:11:10,038 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 1, total: 8 local rank: 1. 2023-06-26 10:11:10,038 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 7, total: 8 local rank: 7. 2023-06-26 10:11:13,500 - WARNING: No OpenAI API Key set. Setting metric to BLEU. 2023-06-26 10:11:13,500 - INFO: Global random seed: 689550 2023-06-26 10:11:13,500 - WARNING: No OpenAI API Key set. Setting metric to BLEU. 2023-06-26 10:11:13,500 - WARNING: No OpenAI API Key set. Setting metric to BLEU. 2023-06-26 10:11:13,500 - WARNING: No OpenAI API Key set. Setting metric to BLEU. 2023-06-26 10:11:13,500 - WARNING: No OpenAI API Key set. Setting metric to BLEU. 2023-06-26 10:11:13,500 - WARNING: No OpenAI API Key set. Setting metric to BLEU. 2023-06-26 10:11:13,500 - WARNING: No OpenAI API Key set. Setting metric to BLEU. 2023-06-26 10:11:13,501 - WARNING: No OpenAI API Key set. Setting metric to BLEU. 2023-06-26 10:11:13,501 - INFO: Preparing the data... 2023-06-26 10:11:13,501 - INFO: Setting up automatic validation split... 2023-06-26 10:11:13,598 - INFO: Preparing train and validation data 2023-06-26 10:11:13,599 - INFO: Loading train dataset... Using pad_token, but it is not set yet. Using cls_token, but it is not set yet. Using sep_token, but it is not set yet. Using pad_token, but it is not set yet. Using cls_token, but it is not set yet. Using sep_token, but it is not set yet. Using pad_token, but it is not set yet. Using cls_token, but it is not set yet. Using sep_token, but it is not set yet. Using pad_token, but it is not set yet. Using cls_token, but it is not set yet. Using sep_token, but it is not set yet. Using pad_token, but it is not set yet. Using cls_token, but it is not set yet. Using sep_token, but it is not set yet. Using pad_token, but it is not set yet. Using cls_token, but it is not set yet. Using sep_token, but it is not set yet. 2023-06-26 10:11:13,733 - INFO: Stop token ids: [tensor([ 29, 93, 31984, 49651]), tensor([ 29, 93, 43274, 49651])] Using pad_token, but it is not set yet. Using cls_token, but it is not set yet. Using sep_token, but it is not set yet. Using pad_token, but it is not set yet. Using cls_token, but it is not set yet. Using sep_token, but it is not set yet. 2023-06-26 10:11:13,786 - INFO: Sample prompt: <|prompt|>As an AI what are your personal feelings on Sarah Connor?<|endoftext|><|answer|> 2023-06-26 10:11:13,787 - INFO: Loading validation dataset... Using pad_token, but it is not set yet. Using cls_token, but it is not set yet. Using sep_token, but it is not set yet. Using pad_token, but it is not set yet. Using pad_token, but it is not set yet. Using cls_token, but it is not set yet. Using cls_token, but it is not set yet. Using sep_token, but it is not set yet. Using sep_token, but it is not set yet. Using pad_token, but it is not set yet. Using cls_token, but it is not set yet. Using sep_token, but it is not set yet. Using pad_token, but it is not set yet. Using cls_token, but it is not set yet. Using sep_token, but it is not set yet. 2023-06-26 10:11:13,874 - INFO: Using int8 for backbone 2023-06-26 10:11:13,876 - INFO: Using int8 for backbone 2023-06-26 10:11:13,877 - INFO: Using int8 for backbone Using pad_token, but it is not set yet. Using cls_token, but it is not set yet. Using sep_token, but it is not set yet. 2023-06-26 10:11:13,879 - INFO: Stop token ids: [tensor([ 29, 93, 31984, 49651]), tensor([ 29, 93, 43274, 49651])] 2023-06-26 10:11:13,882 - INFO: Sample prompt: <|prompt|>What types of tests do we have in software development?<|endoftext|><|answer|> 2023-06-26 10:11:13,882 - INFO: Number of observations in train dataset: 8191 2023-06-26 10:11:13,883 - INFO: Number of observations in validation dataset: 83 Using pad_token, but it is not set yet. Using cls_token, but it is not set yet. Using sep_token, but it is not set yet. 2023-06-26 10:11:13,889 - INFO: Using int8 for backbone Using pad_token, but it is not set yet. Using cls_token, but it is not set yet. Using sep_token, but it is not set yet. 2023-06-26 10:11:13,895 - INFO: Using int8 for backbone 2023-06-26 10:11:13,900 - INFO: Using int8 for backbone 2023-06-26 10:11:13,904 - INFO: Using int8 for backbone 2023-06-26 10:11:13,987 - INFO: Using int8 for backbone trainable params: 1,310,720 || all params: 2,776,519,680 || trainable%: 0.04720730090413045 trainable params: 1,310,720 || all params: 2,776,519,680 || trainable%: 0.04720730090413045 trainable params: 1,310,720 || all params: 2,776,519,680 || trainable%: 0.04720730090413045 trainable params: 1,310,720 || all params: 2,776,519,680 || trainable%: 0.04720730090413045 trainable params: 1,310,720 || all params: 2,776,519,680 || trainable%: 0.04720730090413045 trainable params: 1,310,720 || all params: 2,776,519,680 || trainable%: 0.04720730090413045 trainable params: 1,310,720 || all params: 2,776,519,680 || trainable%: 0.04720730090413045 trainable params: 1,310,720 || all params: 2,776,519,680 || trainable%: 0.04720730090413045 2023-06-26 10:11:24,967 - INFO: Training Epoch: 1 / 3 2023-06-26 10:11:24,968 - INFO: train loss: 0%| | 0/341 [00:00<?, ?it/s] Using pad_token, but it is not set yet. Using cls_token, but it is not set yet. Using sep_token, but it is not set yet. 2023-06-26 10:11:25,750 - INFO: Stop token ids: [tensor([ 29, 93, 31984, 49651]), tensor([ 29, 93, 43274, 49651])] /home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:321: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization") /home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:321: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization") /home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:321: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization") /home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:321: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization") /home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:321: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization") /home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:321: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization") /home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:321: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization") /home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:321: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization") /home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py:230: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead. (Triggered internally at ../aten/src/ATen/native/TensorCompare.cpp:493.) attn_scores = torch.where(causal_mask, attn_scores, mask_value) /home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py:230: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead. (Triggered internally at ../aten/src/ATen/native/TensorCompare.cpp:493.) attn_scores = torch.where(causal_mask, attn_scores, mask_value) /home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py:230: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead. (Triggered internally at ../aten/src/ATen/native/TensorCompare.cpp:493.) attn_scores = torch.where(causal_mask, attn_scores, mask_value) /home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py:230: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead. (Triggered internally at ../aten/src/ATen/native/TensorCompare.cpp:493.) attn_scores = torch.where(causal_mask, attn_scores, mask_value) /home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py:230: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead. (Triggered internally at ../aten/src/ATen/native/TensorCompare.cpp:493.) attn_scores = torch.where(causal_mask, attn_scores, mask_value) /home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py:230: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead. (Triggered internally at ../aten/src/ATen/native/TensorCompare.cpp:493.) attn_scores = torch.where(causal_mask, attn_scores, mask_value) /home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py:230: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead. (Triggered internally at ../aten/src/ATen/native/TensorCompare.cpp:493.) attn_scores = torch.where(causal_mask, attn_scores, mask_value) /home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py:230: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead. (Triggered internally at ../aten/src/ATen/native/TensorCompare.cpp:493.) attn_scores = torch.where(causal_mask, attn_scores, mask_value) 2023-06-26 10:11:26,183 - ERROR: Exception occurred during H2O LLM Studio run: Traceback (most recent call last): File "/home/ubuntu/h2o-llmstudio/train_wave.py", line 106, in run(cfg=cfg) File "/home/ubuntu/h2o-llmstudio/train.py", line 672, in run val_loss, val_metric = run_train( File "/home/ubuntu/h2o-llmstudio/train.py", line 365, in run_train scaler.scale(loss).backward() # type: ignore File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply return user_fn(self, *args) File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 479, in backward grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A) RuntimeError: expected scalar type Half but found Float 2023-06-26 10:11:26,190 - ERROR: Exception occurred during H2O LLM Studio run: Traceback (most recent call last): File "/home/ubuntu/h2o-llmstudio/train_wave.py", line 106, in run(cfg=cfg) File "/home/ubuntu/h2o-llmstudio/train.py", line 672, in run val_loss, val_metric = run_train( File "/home/ubuntu/h2o-llmstudio/train.py", line 365, in run_train scaler.scale(loss).backward() # type: ignore File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply return user_fn(self, *args) File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 479, in backward grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A) RuntimeError: expected scalar type Half but found Float 2023-06-26 10:11:26,190 - ERROR: Exception occurred during H2O LLM Studio run: Traceback (most recent call last): File "/home/ubuntu/h2o-llmstudio/train_wave.py", line 106, in run(cfg=cfg) File "/home/ubuntu/h2o-llmstudio/train.py", line 672, in run val_loss, val_metric = run_train( File "/home/ubuntu/h2o-llmstudio/train.py", line 365, in run_train scaler.scale(loss).backward() # type: ignore File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply return user_fn(self, *args) File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 479, in backward grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A) RuntimeError: expected scalar type Half but found Float 2023-06-26 10:11:26,195 - ERROR: Exception occurred during H2O LLM Studio run: Traceback (most recent call last): File "/home/ubuntu/h2o-llmstudio/train_wave.py", line 106, in run(cfg=cfg) File "/home/ubuntu/h2o-llmstudio/train.py", line 672, in run val_loss, val_metric = run_train( File "/home/ubuntu/h2o-llmstudio/train.py", line 365, in run_train scaler.scale(loss).backward() # type: ignore File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply return user_fn(self, *args) File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 479, in backward grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A) RuntimeError: expected scalar type Half but found Float 2023-06-26 10:11:26,204 - ERROR: Exception occurred during H2O LLM Studio run: Traceback (most recent call last): File "/home/ubuntu/h2o-llmstudio/train_wave.py", line 106, in run(cfg=cfg) File "/home/ubuntu/h2o-llmstudio/train.py", line 672, in run val_loss, val_metric = run_train( File "/home/ubuntu/h2o-llmstudio/train.py", line 365, in run_train scaler.scale(loss).backward() # type: ignore File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply return user_fn(self, *args) File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 479, in backward grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A) RuntimeError: expected scalar type Half but found Float 2023-06-26 10:11:26,212 - ERROR: Exception occurred during H2O LLM Studio run: Traceback (most recent call last): File "/home/ubuntu/h2o-llmstudio/train_wave.py", line 106, in run(cfg=cfg) File "/home/ubuntu/h2o-llmstudio/train.py", line 672, in run val_loss, val_metric = run_train( File "/home/ubuntu/h2o-llmstudio/train.py", line 365, in run_train scaler.scale(loss).backward() # type: ignore File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply return user_fn(self, *args) File "/home/ubuntu/.local/share/virtualenvs/h2o-llmstudio-B1eqiLCk/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 479, in backward grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A) RuntimeError: expected scalar type Half but found Float

Any assistance on this is appreciated! Thank you.

binga avatar Jun 26 '23 10:06 binga

Thanks for opening the issue @binga, I'll have a look. Could you maybe share cfg.yaml file of the failed experiment? I tried int8 on latest main (while keeping default params for eveything else), training starts without errors.

maxjeblick avatar Jun 26 '23 10:06 maxjeblick

Sure, here you go.

architecture: backbone_dtype: int8 force_embedding_gradients: false gradient_checkpointing: false intermediate_dropout: 0.0 pretrained: true pretrained_weights: '' augmentation: random_parent_probability: 0.0 skip_parent_probability: 0.0 token_mask_probability: 0.0 dataset: add_eos_token_to_answer: true add_eos_token_to_prompt: true answer_column: output chatbot_author: H2O.ai chatbot_name: h2oGPT data_sample: 1.0 data_sample_choice: - Train - Validation limit_chained_samples: false mask_prompt_labels: true parent_id_column: None personalize: false prompt_column: - instruction text_answer_separator: <|answer|> text_prompt_start: <|prompt|> train_dataframe: data/user/oasst/train_full.pq validation_dataframe: None validation_size: 0.01 validation_strategy: automatic environment: compile_model: false find_unused_parameters: false gpus: - '0' - '1' - '2' - '3' - '4' - '5' - '6' - '7' huggingface_branch: main mixed_precision: true number_of_workers: 8 seed: -1 trust_remote_code: true use_fsdp: false experiment_name: green-kudu llm_backbone: EleutherAI/pythia-2.8b-deduped logging: logger: None neptune_project: '' number_of_texts: 10 output_directory: output/user/green-kudu/ prediction: batch_size_inference: 0 do_sample: false max_length_inference: 256 metric: BLEU min_length_inference: 2 num_beams: 1 num_history: 2 repetition_penalty: 1.2 stop_tokens: '' temperature: 0.3 top_k: 0 top_p: 1.0 tokenizer: add_prefix_space: false add_prompt_answer_tokens: false max_length: 512 max_length_answer: 256 max_length_prompt: 256 padding_quantile: 1.0 use_fast: true training: adaptive_kl_control: true advantages_gamma: 0.99 advantages_lambda: 0.95 batch_size: 3 differential_learning_rate: 1.0e-05 differential_learning_rate_layers: [] drop_last_batch: true epochs: 3 evaluate_before_training: false evaluation_epochs: 1.0 grad_accumulation: 1 gradient_clip: 0.0 initial_kl_coefficient: 0.2 kl_horizon: 10000 kl_target: 6.0 learning_rate: 0.0001 lora: true lora_alpha: 16 lora_dropout: 0.05 lora_r: 4 lora_target_modules: '' loss_function: TokenAveragedCrossEntropy offload_reward_model: false optimizer: AdamW ppo_batch_size: 1 ppo_clip_policy: 0.2 ppo_clip_value: 0.2 ppo_epochs: 4 ppo_generate_temperature: 1.0 reward_model: OpenAssistant/reward-model-deberta-v3-large-v2 save_best_checkpoint: false scaling_factor_value_loss: 0.1 schedule: Cosine train_validation_data: false use_rlhf: false warmup_epochs: 0.0 weight_decay: 0.0

binga avatar Jun 26 '23 10:06 binga

We checked training using the parameters above (DDP, on less than 8gpus) on 2 different machines, but we could not reproduce the error. How did you prepare your python environment?

maxjeblick avatar Jun 26 '23 13:06 maxjeblick

Here are the steps I followed before setting up h2o-llmstudio.

  1. I used this AMI from AWS - Deep Learning AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) 20230620
  2. Installed Python using these steps - https://github.com/h2oai/h2o-llmstudio#system-installs-python-310
  3. Installed VNC using these steps - https://ubuntu.com/tutorials/ubuntu-desktop-aws#1-overview
  4. Run make setup.

binga avatar Jun 27 '23 02:06 binga

Seems to be an open issue with bitsandbytes library, occurring on Tesla V100 GPUs. I also found this stackoverflow post that has a concise code example to reproduce the error.

I will monitor the issue linked above and push a fix (probably an updated bitsandbytes version) once the issue has been resolved.

maxjeblick avatar Jun 27 '23 07:06 maxjeblick

@binga could you please try setting and check if it changes something?

gradient_checkpointing = True

psinger avatar Jun 27 '23 07:06 psinger

Understood, thanks Max.

@psinger, I tried this yesterday and I see the same failure.

On Tue, Jun 27, 2023 at 13:15 Philipp Singer @.***> wrote:

@binga https://github.com/binga could you please try setting and check if it changes something?

gradient_checkpointing = True

— Reply to this email directly, view it on GitHub https://github.com/h2oai/h2o-llmstudio/issues/185#issuecomment-1608969955, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRIQ3RM6H5YFBJVYOWZNNLXNKFRTANCNFSM6AAAAAAZT6DFZI . You are receiving this because you were mentioned.Message ID: @.***>

-- Best, Phani.

binga avatar Jun 27 '23 07:06 binga

Thanks @binga - sorry but cannot reproduce your issue with same settings.

How are you running LLM Studio, via GUI and make wave? I am just trying to make sure that it is not a package mismatch issue you are seeing there.

psinger avatar Jun 27 '23 07:06 psinger

Seems to be a V100 issue indeed though, also discussed here: https://github.com/tloen/alpaca-lora/issues/485

psinger avatar Jun 27 '23 07:06 psinger

@psinger -- I am running it via GUI. Happy to do a quick call and debug together if you're available!

binga avatar Jun 27 '23 08:06 binga

@binga if V100 is the reason, as it seems to be, we will unfortunately not find a solution to this

And did you try int4 instead of int8?

psinger avatar Jun 27 '23 08:06 psinger

Understood. Trying it now.

binga avatar Jun 27 '23 08:06 binga

@psinger - int4 finetuning works on V100 (32GB) machine. Thank you.

int8 shows the same error described in #185 .

This is with gradient_checkpointing=True.

binga avatar Jun 27 '23 09:06 binga

Closing this as int4 is a workaround for V100 GPUs.

psinger avatar Aug 14 '23 08:08 psinger