starcoder
starcoder copied to clipboard
RuntimeError: RuntimeError: IndexError: list index out of range - multiple GPUs
Trying to fine tune bigcode/starcoderbase model on compute A100 with 2 GPUs , 40 GBx2 so 80GB. Finetune.py is slightly modified and loaded the model with 4bit, adopt Qlora and also the deep speed. The deepspeed version is 0.9.3. Transformers version is 4.31.0 and accelerate version is 0.21.0 Deepspeed is using the same configuration as mentioned in the below chat: [starcoder/chat/deepspeed_z3_config_bf16.json at main · bigcode-project/starcoder · GitHub](url)
The finetuning actually started and faced the error during the backpropagation and the error is pasted below.
python finetune/finetune.py --model_path "bigcode/starcoder" --dataset_name "semeru/text-code-codesummarization" --subset "data/finetune" --split "validation" --size_valid_set 10000 --streaming True --seq_length 2048 --max_steps 1000 --batch_size 1 --input_column_name="input" --output_column_name="output" --gradient_accumulation_steps 16 --learning_rate 1e-4 --lr_scheduler_type "cosine" --num_warmup_steps 100 --weight_decay 0.05 --output_dir "./checkpoints" Setting ds_accelerator to cuda (auto detect)
===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /opt/conda/lib/python3.7/site-packages/bitsandbytes/libbitsandbytes_cuda113.so
/opt/conda/lib/python3.7/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: Loading binary /opt/conda/lib/python3.7/site-packages/bitsandbytes/libbitsandbytes_cuda113.so...
2023-06-26 05:07:25.325039: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2023-06-26 05:07:25.325166: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2023-06-26 05:07:25.325186: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
giii True data/finetune validation
HERE: <datasets.iterable_dataset.IterableDataset object at 0x7f6f6f27ae50>
Loading the dataset in streaming mode
100%|███████████████████████████████| 400/400 [00:00<00:00, 430.87it/s]
The character to token ratio of the dataset is: 5.46
Loading the model
/home/unnati/.local/lib/python3.7/site-packages/transformers/modeling_utils.py:2193: FutureWarning: The use_auth_token
argument is deprecated and will be removed in v5 of Transformers.
"The use_auth_token
argument is deprecated and will be removed in v5 of Transformers.", FutureWarning
Loading checkpoint shards: 100%|█████████| 7/7 [00:33<00:00, 4.76s/it]
trainable params: 35553280 || all params: 7971805184 || trainable%: 0.4459878180585503
Starting main loop
[2023-06-26 05:09:05,370] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-06-26 05:09:05,370] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-06-26 05:09:05,370] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Training...
Installed CUDA version 11.3 does not match the version torch was compiled with 11.6 but since the APIs are compatible, accepting this combination
Using /home/unnati/.cache/torch_extensions/py37_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/unnati/.cache/torch_extensions/py37_cu116/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.2604484558105469 seconds
Using /home/unnati/.cache/torch_extensions/py37_cu116 as PyTorch extensions root...
Emitting ninja build file /home/unnati/.cache/torch_extensions/py37_cu116/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.2621340751647949 seconds
Parameter Offload: Total persistent parameters: 18454528 in 482 params
Using /home/unnati/.cache/torch_extensions/py37_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004565715789794922 seconds
╭───────────────── Traceback (most recent call last) ─────────────────╮
│ /home/unnati/starchat_lora_deepspeed/starcoder/finetune/finetune.py │
│ :336 in engine.step
performs the following op │
│ 170 │ │ # - gradient accumulation check │
│ │
│ /opt/conda/lib/python3.7/site-packages/deepspeed/utils/nvtx.py:15 │
│ in wrapped_fn │
│ │
│ 12 │ │
│ 13 │ def wrapped_fn(*args, **kwargs): │
│ 14 │ │ get_accelerator().range_push(func.qualname) │
│ ❱ 15 │ │ ret_val = func(*args, **kwargs) │
│ 16 │ │ get_accelerator().range_pop() │
│ 17 │ │ return ret_val │
│ 18 │
│ │
│ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/engine.py: │
│ 1862 in backward │
│ │
│ 1859 │ │ │
│ 1860 │ │ if self.zero_optimization(): │
│ 1861 │ │ │ self.optimizer.is_gradient_accumulation_boundary │
│ ❱ 1862 │ │ │ self.optimizer.backward(loss, retain_graph=retai │
│ 1863 │ │ elif self.amp_enabled(): │
│ 1864 │ │ │ # AMP requires delaying unscale when inside grad │
│ 1865 │ │ │ # https://nvidia.github.io/apex/advanced.html#gr │
│ │
│ /opt/conda/lib/python3.7/site-packages/deepspeed/utils/nvtx.py:15 │
│ in wrapped_fn │
│ │
│ 12 │ │
│ 13 │ def wrapped_fn(*args, **kwargs): │
│ 14 │ │ get_accelerator().range_push(func.qualname) │
│ ❱ 15 │ │ ret_val = func(*args, **kwargs) │
│ 16 │ │ get_accelerator().range_pop() │
│ 17 │ │ return ret_val │
│ 18 │
│ │
│ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage │
│ 3.py:1968 in backward │
│ │
│ 1965 │ │ │ scaled_loss = self.external_loss_scale * loss │
│ 1966 │ │ │ scaled_loss.backward() │
│ 1967 │ │ else: │
│ ❱ 1968 │ │ │ self.loss_scaler.backward(loss.float(), retain_g │
│ 1969 │ │ │
│ 1970 │ │ self.get_param_coordinator(training=True).reset_ste │
│ 1971 │
│ │
│ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/fp16/loss │
│ scaler.py:63 in backward │
│ │
│ 60 │ │
│ 61 │ def backward(self, loss, retain_graph=False): │
│ 62 │ │ scaled_loss = loss * self.loss_scale │
│ ❱ 63 │ │ scaled_loss.backward(retain_graph=retain_graph) │
│ 64 │ │ # print(f'LossScalerBackward: {scaled_loss=}') │
│ 65 │
│ 66 │
│ │
│ /opt/conda/lib/python3.7/site-packages/torch/_tensor.py:489 in │
│ backward │
│ │
│ 486 │ │ │ │ inputs=inputs, │
│ 487 │ │ │ ) │
│ 488 │ │ torch.autograd.backward( │
│ ❱ 489 │ │ │ self, gradient, retain_graph, create_graph, inpu │
│ 490 │ │ ) │
│ 491 │ │
│ 492 │ def register_hook(self, hook): │
│ │
│ /opt/conda/lib/python3.7/site-packages/torch/autograd/init.py:1 │
│ 99 in backward │
│ │
│ 196 │ # calls in the traceback and some print out the last line │
│ 197 │ Variable.execution_engine.run_backward( # Calls into th │
│ 198 │ │ tensors, grad_tensors, retain_graph, create_graph, i │
│ ❱ 199 │ │ allow_unreachable=True, accumulate_grad=True) # Call │
│ 200 │
│ 201 def grad( │
│ 202 │ outputs: _TensorOrTensors, │
╰─────────────────────────────────────────────────────────────────────╯
RuntimeError: RuntimeError: IndexError: list index out of range
At: /opt/conda/lib/python3.7/site-packages/torch/utils/checkpoint.py(382) : inner_pack /opt/conda/lib/python3.7/site-packages/torch/nn/functional.py(1252): dropout /opt/conda/lib/python3.7/site-packages/torch/nn/modules/dropout.py(59 ): forward /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(121 2): _call_impl /home/unnati/.local/lib/python3.7/site-packages/transformers/models/g pt_bigcode/modeling_gpt_bigcode.py(280): forward /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py(165): new_forward /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(121 2): _call_impl /home/unnati/.local/lib/python3.7/site-packages/transformers/models/g pt_bigcode/modeling_gpt_bigcode.py(354): forward /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py(165): new_forward /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(121 2): _call_impl /home/unnati/.local/lib/python3.7/site-packages/transformers/models/g pt_bigcode/modeling_gpt_bigcode.py(661): custom_forward /opt/conda/lib/python3.7/site-packages/torch/utils/checkpoint.py(408) : unpack
At: /opt/conda/lib/python3.7/site-packages/torch/nn/functional.py(1252): dropout /opt/conda/lib/python3.7/site-packages/torch/nn/modules/dropout.py(59 ): forward /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(121 2): _call_impl /home/unnati/.local/lib/python3.7/site-packages/transformers/models/g pt_bigcode/modeling_gpt_bigcode.py(280): forward /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py(165): new_forward /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(121 2): _call_impl /home/unnati/.local/lib/python3.7/site-packages/transformers/models/g pt_bigcode/modeling_gpt_bigcode.py(354): forward /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py(165): new_forward /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(121 2): _call_impl /home/unnati/.local/lib/python3.7/site-packages/transformers/models/g pt_bigcode/modeling_gpt_bigcode.py(661): custom_forward /opt/conda/lib/python3.7/site-packages/torch/utils/checkpoint.py(408) : unpack
Finetune.py (modified script)
from transformers import BitsAndBytesConfig
local_rank=-1
deepspeed="ds.json"
def run_training(args, train_data, val_data):
print("Loading the model")
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
args.model_path,
use_auth_token=True,
use_cache=not args.no_gradient_checkpointing,
quantization_config=bnb_config,
)
training_args = TrainingArguments(
output_dir=args.output_dir,
dataloader_drop_last=True,
evaluation_strategy="steps",
max_steps=args.max_steps,
eval_steps=args.eval_freq,
save_steps=args.save_freq,
logging_steps=args.log_freq,
per_device_train_batch_size=args.batch_size,
per_device_eval_batch_size=args.batch_size,
learning_rate=args.learning_rate,
lr_scheduler_type=args.lr_scheduler_type,
warmup_steps=args.num_warmup_steps,
gradient_accumulation_steps=args.gradient_accumulation_steps,
gradient_checkpointing=not args.no_gradient_checkpointing,
fp16=not args.no_fp16,
bf16=args.bf16,
weight_decay=args.weight_decay,
run_name="StarCoder-finetuned",
do_train=True,
local_rank=local_rank,
deepspeed=deepspeed,
ddp_find_unused_parameters=False,
)
`
Hi @Kushalamummigatti , I faced similar issue, and realized that my assignment of validation set was incorrect. When the length of my validation set was 0 , I got a similar error.
Hi @Kushalamummigatti , I faced similar issue, and realized that my assignment of validation set was incorrect. When the length of my validation set was 0 , I got a similar error.
Thanks for the response. In the modified script am trying to adopt Qlora as i have mentioned the modified code. Currently this error is not generated. But strangely the code stops execution after the bitsandbytes and not even downloading the model. No major warnings. Am not able to find whether Qlora is not supported for starcoder.
I have the same problem. Tried to finetune starcoder with qlora but they all failed. Probably, qlora does not support starcoder. I could run the finetune starcoder with qlora but the output didn't seem to invalid (didn't work with inference) There is someone claimed that they did it successfully but not really sure (https://github.com/artidoro/qlora/issues/121)