qlora icon indicating copy to clipboard operation
qlora copied to clipboard

Training proceeds fine when using 2 GPUs but fails with SIGTERM error when using 4 V100 GPUS

Open joeforan76 opened this issue 1 year ago • 0 comments

I'm using qlora on a machine with 4 32GB V100 gpus. If I use only 2 of the GPUs, training proceeds without any problem but when I use all 4 GPUS I get the following error (duplicated messages ellided)

bin /opt/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116_nocublaslt.so
/opt/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')}
  warn(msg)
/opt/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/opt/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_n0s988ik/none_x54lteke/attempt_0/0/error.json')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/opt/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 116
/opt/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
  warn(msg)
CUDA SETUP: Loading binary /opt/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116_nocublaslt.so...
...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
...
Found cached dataset json (/huggingface_cache/datasets/Abirate___json/Abirate--english_quotes-6e72855d06356857/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100% 1/1 [00:00<00:00, 296.67it/s]
Loading cached processed dataset at /huggingface_cache/datasets/Abirate___json/Abirate--english_quotes-6e72855d06356857/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-30fcad8a80852380.arrow
...
Loading checkpoint shards: 100% 33/33 [00:17<00:00,  1.86it/s]
/opt/venv/lib/python3.10/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
You have loaded a model on multiple GPUs. `is_model_parallel` attribute will be force-set to `True` to avoid any unexpected behavior such as device placement mismatching.
The model is loaded in 8-bit precision. To train this model you need to add additional modules inside the model such as adapters using `peft` library and freeze the model weights. Please check  the examples in https://github.com/huggingface/peft for more details.
max_steps is given, it will override any value given in num_train_epochs
Currently training with a batch size of: 1
The following columns in the training set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: author, quote, tags. If author, quote, tags are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.
...
/opt/venv/lib/python3.10/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 30 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 31 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 33 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 2 (pid: 32) of binary: /opt/venv/bin/python3
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
==================================================
/base/script/reproduce_error.py FAILED
--------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-13_14:40:56
  host      : qlora_exp
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 32)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 32
==================================================

The output of running python -m bitsandbytes is

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++ /usr/local CUDA PATHS +++++++++++++++++++
/usr/local/cuda-11.6/compat/libcuda.so
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudart.so
/usr/local/cuda-11.6/targets/x86_64-linux/lib/stubs/libcuda.so

+++++++++++++++ WORKING DIRECTORY CUDA PATHS +++++++++++++++


++++++++++++++++++ LD_LIBRARY CUDA PATHS +++++++++++++++++++

++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
COMPILED_WITH_CUDA = True
COMPUTE_CAPABILITIES_PER_GPU = ['7.0', '7.0', '7.0', '7.0']
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

and the output of transformers-cli env is

- `transformers` version: 4.31.0.dev0
- Platform: Linux-3.10.0-1160.49.1.el7.x86_64-x86_64-with-glibc2.27
- Python version: 3.10.8
- Huggingface_hub version: 0.15.1
- Safetensors version: 0.3.1
- PyTorch version (GPU?): 1.13.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: YES
- Using distributed or parallel set-up in script?: YES

A minimal script to reproduce the problem is as below (the same thing occurs if I change model/dataset)

import torch

from transformers import (
    AutoModelForCausalLM,
    LlamaTokenizer,
    DataCollatorForLanguageModeling,
    BitsAndBytesConfig,
    Trainer,
    TrainingArguments
)
from datasets import load_dataset

from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model


def prepare_model(model_id, rank):
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        device_map={"": rank})

    model.gradient_checkpointing_enable()
    model = prepare_model_for_kbit_training(model)
    target_modules=["q_proj", "v_proj", "k_proj"]

    config = LoraConfig(
        r=8,
        lora_alpha=32,
        target_modules=target_modules,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )

    model = get_peft_model(model, config)

    return model


def main():
    model_id = "decapoda-research/llama-7b-hf"
    data_name = "Abirate/english_quotes"
    output_dir="/output"

    training_args = TrainingArguments(
        output_dir=output_dir,
        fp16=True,
        label_smoothing_factor=0.1,
        per_device_train_batch_size=1,
        per_device_eval_batch_size=1,
        ddp_find_unused_parameters=False,
        gradient_accumulation_steps=1,
        max_steps=100,
        log_level='debug',
        logging_steps=1
        )

    tokenizer = LlamaTokenizer.from_pretrained(model_id)
    tokenizer.bos_token_id = 1
    tokenizer.pad_token = tokenizer.bos_token

    train_data = load_dataset(data_name)
    train_data = train_data["train"].map(lambda samples: tokenizer(samples["quote"]),
                                batched=True)

    model = prepare_model(model_id, training_args.local_rank)

    trainer = Trainer(
        model=model,
        train_dataset=train_data,
        args=training_args,
        data_collator=DataCollatorForLanguageModeling(tokenizer,
                                                      mlm=False)
    )
    model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
    trainer.train()


if __name__ == '__main__':
    main()

and I'm calling it in a docker container based on the pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel image

I also tried using pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel but I got the same result.

Any advice or tips would be very welcome!

joeforan76 avatar Jun 13 '23 05:06 joeforan76