starcoder icon indicating copy to clipboard operation
starcoder copied to clipboard

RuntimeError: RuntimeError: IndexError: list index out of range - multiple GPUs

Open Kushalamummigatti opened this issue 1 year ago • 3 comments

Trying to fine tune bigcode/starcoderbase model on compute A100 with 2 GPUs , 40 GBx2 so 80GB. Finetune.py is slightly modified and loaded the model with 4bit, adopt Qlora and also the deep speed. The deepspeed version is 0.9.3. Transformers version is 4.31.0 and accelerate version is 0.21.0 Deepspeed is using the same configuration as mentioned in the below chat: [starcoder/chat/deepspeed_z3_config_bf16.json at main · bigcode-project/starcoder · GitHub](url)

The finetuning actually started and faced the error during the backpropagation and the error is pasted below.

python finetune/finetune.py --model_path "bigcode/starcoder" --dataset_name "semeru/text-code-codesummarization" --subset "data/finetune" --split "validation" --size_valid_set 10000 --streaming True --seq_length 2048 --max_steps 1000 --batch_size 1 --input_column_name="input" --output_column_name="output" --gradient_accumulation_steps 16 --learning_rate 1e-4 --lr_scheduler_type "cosine" --num_warmup_steps 100 --weight_decay 0.05 --output_dir "./checkpoints" Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /opt/conda/lib/python3.7/site-packages/bitsandbytes/libbitsandbytes_cuda113.so /opt/conda/lib/python3.7/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 113 CUDA SETUP: Loading binary /opt/conda/lib/python3.7/site-packages/bitsandbytes/libbitsandbytes_cuda113.so... 2023-06-26 05:07:25.325039: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64 2023-06-26 05:07:25.325166: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64 2023-06-26 05:07:25.325186: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. giii True data/finetune validation HERE: <datasets.iterable_dataset.IterableDataset object at 0x7f6f6f27ae50> Loading the dataset in streaming mode 100%|███████████████████████████████| 400/400 [00:00<00:00, 430.87it/s] The character to token ratio of the dataset is: 5.46 Loading the model /home/unnati/.local/lib/python3.7/site-packages/transformers/modeling_utils.py:2193: FutureWarning: The use_auth_token argument is deprecated and will be removed in v5 of Transformers. "The use_auth_token argument is deprecated and will be removed in v5 of Transformers.", FutureWarning Loading checkpoint shards: 100%|█████████| 7/7 [00:33<00:00, 4.76s/it] trainable params: 35553280 || all params: 7971805184 || trainable%: 0.4459878180585503 Starting main loop [2023-06-26 05:09:05,370] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-06-26 05:09:05,370] [INFO] [comm.py:594:init_distributed] cdb=None [2023-06-26 05:09:05,370] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Training... Installed CUDA version 11.3 does not match the version torch was compiled with 11.6 but since the APIs are compatible, accepting this combination Using /home/unnati/.cache/torch_extensions/py37_cu116 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/unnati/.cache/torch_extensions/py37_cu116/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.2604484558105469 seconds Using /home/unnati/.cache/torch_extensions/py37_cu116 as PyTorch extensions root... Emitting ninja build file /home/unnati/.cache/torch_extensions/py37_cu116/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.2621340751647949 seconds Parameter Offload: Total persistent parameters: 18454528 in 482 params Using /home/unnati/.cache/torch_extensions/py37_cu116 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0004565715789794922 seconds ╭───────────────── Traceback (most recent call last) ─────────────────╮ │ /home/unnati/starchat_lora_deepspeed/starcoder/finetune/finetune.py │ │ :336 in │ │ │ │ 333 │ │ │ 334 │ logging.set_verbosity_error() │ │ 335 │ │ │ ❱ 336 │ main(args) │ │ 337 │ │ │ │ /home/unnati/starchat_lora_deepspeed/starcoder/finetune/finetune.py │ │ :325 in main │ │ │ │ 322 def main(args): │ │ 323 │ tokenizer = AutoTokenizer.from_pretrained(args.model_path │ │ 324 │ train_dataset, eval_dataset = create_datasets(tokenizer, │ │ ❱ 325 │ run_training(args, train_dataset, eval_dataset) │ │ 326 │ │ 327 │ │ 328 if name == "main": │ │ │ │ /home/unnati/starchat_lora_deepspeed/starcoder/finetune/finetune.py │ │ :316 in run_training │ │ │ │ 313 │ trainer = Trainer(model=model, args=training_args, train_ │ │ 314 │ │ │ 315 │ print("Training...") │ │ ❱ 316 │ trainer.train() │ │ 317 │ │ │ 318 │ print("Saving last checkpoint of the model") │ │ 319 │ model.save_pretrained(os.path.join(args.output_dir, "fina │ │ │ │ /home/unnati/.local/lib/python3.7/site-packages/transformers/traine │ │ r.py:1543 in train │ │ │ │ 1540 │ │ │ args=args, │ │ 1541 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │ │ 1542 │ │ │ trial=trial, │ │ ❱ 1543 │ │ │ ignore_keys_for_eval=ignore_keys_for_eval, │ │ 1544 │ │ ) │ │ 1545 │ │ │ 1546 │ def _inner_training_loop( │ │ │ │ /home/unnati/.local/lib/python3.7/site-packages/transformers/traine │ │ r.py:1801 in inner_training_loop │ │ │ │ 1798 │ │ │ │ │ self.control = self.callback_handler.on │ │ 1799 │ │ │ │ │ │ 1800 │ │ │ │ with self.accelerator.accumulate(model): │ │ ❱ 1801 │ │ │ │ │ tr_loss_step = self.training_step(model, │ │ 1802 │ │ │ │ │ │ 1803 │ │ │ │ if ( │ │ 1804 │ │ │ │ │ args.logging_nan_inf_filter │ │ │ │ /home/unnati/.local/lib/python3.7/site-packages/transformers/traine │ │ r.py:2669 in training_step │ │ │ │ 2666 │ │ │ with amp.scale_loss(loss, self.optimizer) as sca │ │ 2667 │ │ │ │ scaled_loss.backward() │ │ 2668 │ │ else: │ │ ❱ 2669 │ │ │ self.accelerator.backward(loss) │ │ 2670 │ │ │ │ 2671 │ │ return loss.detach() / self.args.gradient_accumulati │ │ 2672 │ │ │ │ /opt/conda/lib/python3.7/site-packages/accelerate/accelerator.py:18 │ │ 35 in backward │ │ │ │ 1832 │ │ │ # deepspeed handles loss scaling by gradient_acc │ │ 1833 │ │ │ loss = loss / self.gradient_accumulation_steps │ │ 1834 │ │ if self.distributed_type == DistributedType.DEEPSPEE │ │ ❱ 1835 │ │ │ self.deepspeed_engine_wrapped.backward(loss, **k │ │ 1836 │ │ elif self.distributed_type == DistributedType.MEGATR │ │ 1837 │ │ │ return │ │ 1838 │ │ elif self.scaler is not None: │ │ │ │ /opt/conda/lib/python3.7/site-packages/accelerate/utils/deepspeed.p │ │ y:167 in backward │ │ │ │ 164 │ │ │ 165 │ def backward(self, loss, **kwargs): │ │ 166 │ │ # runs backpropagation and handles mixed precision │ │ ❱ 167 │ │ self.engine.backward(loss, **kwargs) │ │ 168 │ │ │ │ 169 │ │ # Deepspeed's engine.step performs the following op │ │ 170 │ │ # - gradient accumulation check │ │ │ │ /opt/conda/lib/python3.7/site-packages/deepspeed/utils/nvtx.py:15 │ │ in wrapped_fn │ │ │ │ 12 │ │ │ 13 │ def wrapped_fn(*args, **kwargs): │ │ 14 │ │ get_accelerator().range_push(func.qualname) │ │ ❱ 15 │ │ ret_val = func(*args, **kwargs) │ │ 16 │ │ get_accelerator().range_pop() │ │ 17 │ │ return ret_val │ │ 18 │ │ │ │ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/engine.py: │ │ 1862 in backward │ │ │ │ 1859 │ │ │ │ 1860 │ │ if self.zero_optimization(): │ │ 1861 │ │ │ self.optimizer.is_gradient_accumulation_boundary │ │ ❱ 1862 │ │ │ self.optimizer.backward(loss, retain_graph=retai │ │ 1863 │ │ elif self.amp_enabled(): │ │ 1864 │ │ │ # AMP requires delaying unscale when inside grad │ │ 1865 │ │ │ # https://nvidia.github.io/apex/advanced.html#gr │ │ │ │ /opt/conda/lib/python3.7/site-packages/deepspeed/utils/nvtx.py:15 │ │ in wrapped_fn │ │ │ │ 12 │ │ │ 13 │ def wrapped_fn(*args, **kwargs): │ │ 14 │ │ get_accelerator().range_push(func.qualname) │ │ ❱ 15 │ │ ret_val = func(*args, **kwargs) │ │ 16 │ │ get_accelerator().range_pop() │ │ 17 │ │ return ret_val │ │ 18 │ │ │ │ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage │ │ 3.py:1968 in backward │ │ │ │ 1965 │ │ │ scaled_loss = self.external_loss_scale * loss │ │ 1966 │ │ │ scaled_loss.backward() │ │ 1967 │ │ else: │ │ ❱ 1968 │ │ │ self.loss_scaler.backward(loss.float(), retain_g │ │ 1969 │ │ │ │ 1970 │ │ self.get_param_coordinator(training=True).reset_ste │ │ 1971 │ │ │ │ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/fp16/loss │ │ scaler.py:63 in backward │ │ │ │ 60 │ │ │ 61 │ def backward(self, loss, retain_graph=False): │ │ 62 │ │ scaled_loss = loss * self.loss_scale │ │ ❱ 63 │ │ scaled_loss.backward(retain_graph=retain_graph) │ │ 64 │ │ # print(f'LossScalerBackward: {scaled_loss=}') │ │ 65 │ │ 66 │ │ │ │ /opt/conda/lib/python3.7/site-packages/torch/_tensor.py:489 in │ │ backward │ │ │ │ 486 │ │ │ │ inputs=inputs, │ │ 487 │ │ │ ) │ │ 488 │ │ torch.autograd.backward( │ │ ❱ 489 │ │ │ self, gradient, retain_graph, create_graph, inpu │ │ 490 │ │ ) │ │ 491 │ │ │ 492 │ def register_hook(self, hook): │ │ │ │ /opt/conda/lib/python3.7/site-packages/torch/autograd/init.py:1 │ │ 99 in backward │ │ │ │ 196 │ # calls in the traceback and some print out the last line │ │ 197 │ Variable.execution_engine.run_backward( # Calls into th │ │ 198 │ │ tensors, grad_tensors, retain_graph, create_graph, i │ │ ❱ 199 │ │ allow_unreachable=True, accumulate_grad=True) # Call │ │ 200 │ │ 201 def grad( │ │ 202 │ outputs: _TensorOrTensors, │ ╰─────────────────────────────────────────────────────────────────────╯ RuntimeError: RuntimeError: IndexError: list index out of range

At: /opt/conda/lib/python3.7/site-packages/torch/utils/checkpoint.py(382) : inner_pack /opt/conda/lib/python3.7/site-packages/torch/nn/functional.py(1252): dropout /opt/conda/lib/python3.7/site-packages/torch/nn/modules/dropout.py(59 ): forward /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(121 2): _call_impl /home/unnati/.local/lib/python3.7/site-packages/transformers/models/g pt_bigcode/modeling_gpt_bigcode.py(280): forward /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py(165): new_forward /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(121 2): _call_impl /home/unnati/.local/lib/python3.7/site-packages/transformers/models/g pt_bigcode/modeling_gpt_bigcode.py(354): forward /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py(165): new_forward /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(121 2): _call_impl /home/unnati/.local/lib/python3.7/site-packages/transformers/models/g pt_bigcode/modeling_gpt_bigcode.py(661): custom_forward /opt/conda/lib/python3.7/site-packages/torch/utils/checkpoint.py(408) : unpack

At: /opt/conda/lib/python3.7/site-packages/torch/nn/functional.py(1252): dropout /opt/conda/lib/python3.7/site-packages/torch/nn/modules/dropout.py(59 ): forward /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(121 2): _call_impl /home/unnati/.local/lib/python3.7/site-packages/transformers/models/g pt_bigcode/modeling_gpt_bigcode.py(280): forward /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py(165): new_forward /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(121 2): _call_impl /home/unnati/.local/lib/python3.7/site-packages/transformers/models/g pt_bigcode/modeling_gpt_bigcode.py(354): forward /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py(165): new_forward /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(121 2): _call_impl /home/unnati/.local/lib/python3.7/site-packages/transformers/models/g pt_bigcode/modeling_gpt_bigcode.py(661): custom_forward /opt/conda/lib/python3.7/site-packages/torch/utils/checkpoint.py(408) : unpack

Finetune.py (modified script)

from transformers import BitsAndBytesConfig
local_rank=-1
deepspeed="ds.json"
def run_training(args, train_data, val_data):
    print("Loading the model")
  
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )
    model = AutoModelForCausalLM.from_pretrained(
        args.model_path,
        use_auth_token=True,
        use_cache=not args.no_gradient_checkpointing,

        quantization_config=bnb_config,

    )

training_args = TrainingArguments(
        output_dir=args.output_dir,
        dataloader_drop_last=True,
        evaluation_strategy="steps",
        max_steps=args.max_steps,
        eval_steps=args.eval_freq,
        save_steps=args.save_freq,
        logging_steps=args.log_freq,
        per_device_train_batch_size=args.batch_size,
        per_device_eval_batch_size=args.batch_size,
        learning_rate=args.learning_rate,
        lr_scheduler_type=args.lr_scheduler_type,
        warmup_steps=args.num_warmup_steps,
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        gradient_checkpointing=not args.no_gradient_checkpointing,
        fp16=not args.no_fp16,
        bf16=args.bf16,
        weight_decay=args.weight_decay,
        run_name="StarCoder-finetuned",
        do_train=True,

        local_rank=local_rank,
        deepspeed=deepspeed,
        ddp_find_unused_parameters=False,
    )
`

Kushalamummigatti avatar Jun 26 '23 06:06 Kushalamummigatti

Hi @Kushalamummigatti , I faced similar issue, and realized that my assignment of validation set was incorrect. When the length of my validation set was 0 , I got a similar error.

ruchaa0112 avatar Jun 30 '23 19:06 ruchaa0112

Hi @Kushalamummigatti , I faced similar issue, and realized that my assignment of validation set was incorrect. When the length of my validation set was 0 , I got a similar error.

Thanks for the response. In the modified script am trying to adopt Qlora as i have mentioned the modified code. Currently this error is not generated. But strangely the code stops execution after the bitsandbytes and not even downloading the model. No major warnings. Am not able to find whether Qlora is not supported for starcoder.

Kushalamummigatti avatar Jul 01 '23 18:07 Kushalamummigatti

I have the same problem. Tried to finetune starcoder with qlora but they all failed. Probably, qlora does not support starcoder. I could run the finetune starcoder with qlora but the output didn't seem to invalid (didn't work with inference) There is someone claimed that they did it successfully but not really sure (https://github.com/artidoro/qlora/issues/121)

thanhnew2001 avatar Jul 31 '23 14:07 thanhnew2001