peft
peft copied to clipboard
GPU Allocation Issue (QLoRa + Llama3-8B-IT)
System Info
peft: 0.10.1.dev0 accelerate: 0.30.0 bitsandbytes: 0.43.1 transformers: 4.39.3 GPU: A6000 * 2 ( 96GB ) nvidia-driver version: 535.171.04 cuda: 11.8
Who can help?
No response
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder - [x] My own task or dataset (give details below)
Reproduction
I was training a Llama3-8B-IT model with QLoRA. I successed a training, but GPU wasn't evenly allocate. Is it a version issue with peft or transformers? Or is it a version issue with the graphics driver? I have experience with learning evenly on previous A100*8 servers, but I don't know if this is an issue in this case.
This is my script.
quantization_config = BitsAndBytesConfig(
load_in_4bit = True,
bnb_4bit_compute_dtype = torch.bfloat16,
bnb_4bit_quant_type = "nf4",
bnb_4bit_use_double_quant = True
)
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
tok = AutoTokenizer.from_pretrained(MODEL_ID)
tok.pad_token_id = tok.eos_token_id
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config = quantization_config,
device_map = 'auto'
)
data = load_dataset("...")
proc_data = data.map(process, remove_columns = data['train'].column_names)
toknized_proc_data = proc_data.map(lambda x: tok(x['text'], truncation = True, max_length = 2048), batched = True)
toknized_proc_data = toknized_proc_data.remove_columns("text")
lora_config = LoraConfig(
r = 16,
lora_alpha = 32,
lora_dropout = 0.01,
target_modules = "all-linear"
)
model = get_peft_model(model, lora_config)
train_args_trainer = TrainingArguments(
num_train_epochs = 3,
per_device_train_batch_size = 1,
gradient_accumulation_steps = 2,
learning_rate = 2e-8,
logging_steps = 100,
warmup_steps = 100,
save_total_limit = 3,
output_dir = "llama3-7b-4bit-lora-test2",
optim = "paged_adamw_32bit",
bf16 = True,
report_to = "wandb",
run_name = "llama3-7b-4bit-lora-test2",
remove_unused_columns=False
)
model.is_parallelizable = True
model.model_parallel = True
trainer = Trainer(
model = model,
tokenizer = tok,
args = train_args_trainer,
train_dataset = toknized_proc_data['train'],
data_collator = DataCollatorForLanguageModeling(tok, mlm = False)
)
trainer.train()
Wed May 8 06:54:12 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:1F:00.0 Off | Off |
| 30% 58C P2 145W / 300W | 13224MiB / 49140MiB | 40% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:8B:00.0 Off | Off |
| 47% 71C P2 221W / 300W | 32908MiB / 49140MiB | 73% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 207219 C /data/envs/tt/bin/python 13090MiB |
| 1 N/A N/A 207219 C /data/envs/tt/bin/python 32774MiB |
+---------------------------------------------------------------------------------------+
Expected behavior
I want the GPUs to be evenly allocated.
Hmm, hard to say and I can't easily try to reproduce this. Do you already see strange behavior after loading the model, before starting training? If you try without PEFT, do you see the same issue (in case of not having enough memory without PEFT, you could e.g. turn off autograd on most of the layers to "simulate" parameter efficient fine-tuning)? If yes, this could be an accelerate issue.