qlora
qlora copied to clipboard
Training proceeds fine when using 2 GPUs but fails with SIGTERM error when using 4 V100 GPUS
I'm using qlora on a machine with 4 32GB V100 gpus. If I use only 2 of the GPUs, training proceeds without any problem but when I use all 4 GPUS I get the following error (duplicated messages ellided)
bin /opt/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116_nocublaslt.so
/opt/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')}
warn(msg)
/opt/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
/opt/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_n0s988ik/none_x54lteke/attempt_0/0/error.json')}
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/opt/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 116
/opt/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /opt/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116_nocublaslt.so...
...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
...
Found cached dataset json (/huggingface_cache/datasets/Abirate___json/Abirate--english_quotes-6e72855d06356857/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100% 1/1 [00:00<00:00, 296.67it/s]
Loading cached processed dataset at /huggingface_cache/datasets/Abirate___json/Abirate--english_quotes-6e72855d06356857/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-30fcad8a80852380.arrow
...
Loading checkpoint shards: 100% 33/33 [00:17<00:00, 1.86it/s]
/opt/venv/lib/python3.10/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
You have loaded a model on multiple GPUs. `is_model_parallel` attribute will be force-set to `True` to avoid any unexpected behavior such as device placement mismatching.
The model is loaded in 8-bit precision. To train this model you need to add additional modules inside the model such as adapters using `peft` library and freeze the model weights. Please check the examples in https://github.com/huggingface/peft for more details.
max_steps is given, it will override any value given in num_train_epochs
Currently training with a batch size of: 1
The following columns in the training set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: author, quote, tags. If author, quote, tags are not expected by `PeftModelForCausalLM.forward`, you can safely ignore this message.
...
/opt/venv/lib/python3.10/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 30 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 31 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 33 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 2 (pid: 32) of binary: /opt/venv/bin/python3
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 195, in <module>
main()
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
==================================================
/base/script/reproduce_error.py FAILED
--------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-06-13_14:40:56
host : qlora_exp
rank : 2 (local_rank: 2)
exitcode : -7 (pid: 32)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 32
==================================================
The output of running python -m bitsandbytes
is
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ /usr/local CUDA PATHS +++++++++++++++++++
/usr/local/cuda-11.6/compat/libcuda.so
/usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudart.so
/usr/local/cuda-11.6/targets/x86_64-linux/lib/stubs/libcuda.so
+++++++++++++++ WORKING DIRECTORY CUDA PATHS +++++++++++++++
++++++++++++++++++ LD_LIBRARY CUDA PATHS +++++++++++++++++++
++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
COMPILED_WITH_CUDA = True
COMPUTE_CAPABILITIES_PER_GPU = ['7.0', '7.0', '7.0', '7.0']
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
and the output of transformers-cli env
is
- `transformers` version: 4.31.0.dev0
- Platform: Linux-3.10.0-1160.49.1.el7.x86_64-x86_64-with-glibc2.27
- Python version: 3.10.8
- Huggingface_hub version: 0.15.1
- Safetensors version: 0.3.1
- PyTorch version (GPU?): 1.13.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: YES
- Using distributed or parallel set-up in script?: YES
A minimal script to reproduce the problem is as below (the same thing occurs if I change model/dataset)
import torch
from transformers import (
AutoModelForCausalLM,
LlamaTokenizer,
DataCollatorForLanguageModeling,
BitsAndBytesConfig,
Trainer,
TrainingArguments
)
from datasets import load_dataset
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model
def prepare_model(model_id, rank):
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map={"": rank})
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
target_modules=["q_proj", "v_proj", "k_proj"]
config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=target_modules,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
return model
def main():
model_id = "decapoda-research/llama-7b-hf"
data_name = "Abirate/english_quotes"
output_dir="/output"
training_args = TrainingArguments(
output_dir=output_dir,
fp16=True,
label_smoothing_factor=0.1,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
ddp_find_unused_parameters=False,
gradient_accumulation_steps=1,
max_steps=100,
log_level='debug',
logging_steps=1
)
tokenizer = LlamaTokenizer.from_pretrained(model_id)
tokenizer.bos_token_id = 1
tokenizer.pad_token = tokenizer.bos_token
train_data = load_dataset(data_name)
train_data = train_data["train"].map(lambda samples: tokenizer(samples["quote"]),
batched=True)
model = prepare_model(model_id, training_args.local_rank)
trainer = Trainer(
model=model,
train_dataset=train_data,
args=training_args,
data_collator=DataCollatorForLanguageModeling(tokenizer,
mlm=False)
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
trainer.train()
if __name__ == '__main__':
main()
and I'm calling it in a docker container based on the pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel image
I also tried using pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel but I got the same result.
Any advice or tips would be very welcome!