transformers
transformers copied to clipboard
Idefics2 fine-tuning: Error when unscale_gradients called on FP16 gradients during training with transformers and accelerate
System Info
-
transformers
version: 4.40.0.dev0 - Platform: Linux-5.15.0-101-generic-x86_64-with-glibc2.17
- Python version: 3.8.2
- Huggingface_hub version: 0.20.2
- Safetensors version: 0.4.2
- Accelerate version: 0.30.0.dev0
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.2+cu118 (True)
- Tensorflow version (GPU?): 2.13.1 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
Who can help?
@amyeroberts
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
During the training loop, when accelerator.clip_grad_norm_()
is called, it leads to an unscale operation which fails because the gradients are in FP16. This error suggests a potential issue in handling gradient scaling with mixed precision settings.
USE_LORA = True
if USE_QLORA or USE_LORA:
lora_config = LoraConfig(
r=8,
lora_alpha=8,
lora_dropout=0.1,
target_modules=".*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$",
use_dora=False if USE_QLORA else True,
init_lora_weights="gaussian",
)
if USE_QLORA:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16
)
model = Idefics2ForConditionalGeneration.from_pretrained(
args.model_name,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
quantization_config=bnb_config if USE_QLORA else None,
)
if USE_LORA:
model = model.to(DEVICE)
model.add_adapter(lora_config)
model.enable_adapters()
else:
model = Idefics2ForConditionalGeneration.from_pretrained(
args.model_name,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
_attn_implementation="flash_attention_2", # Only available on A100 or H100
).to(DEVICE)
print_trainable_parameters(model)
training_args = TrainingArguments(
num_train_epochs=args.epochs,
per_device_train_batch_size=args.batch_size_per_device,
# per_device_eval_batch_size=args.batch_size_per_device,
gradient_accumulation_steps=gradient_accumulation_steps,
warmup_steps=50,
learning_rate=args.learning_rate,
weight_decay=args.weight_decay,
lr_scheduler_type=args.lr_scheduler_type,
logging_steps=10,
log_level="info",
output_dir=output_dir,
save_strategy="steps",
save_steps=200,
# eval_steps=200,
save_total_limit=10,
# evaluation_strategy="steps",
fp16=True,
resume_from_checkpoint=True,
push_to_hub_model_id=model_id,
remove_unused_columns=False,
report_to="all",
)
Traceback (most recent call last):
File "runpy.py", line 193, in _run_module_as_main
return run_code(code, main_globals, None,
File "runpy.py", line 86, in run_code
exec(code, run_globals)
File "idefics2_fine_tuning.py", line 302, in <module>
main(args)
File "idefics2_fine_tuning.py", line 251, in main
trainer.train()
File "trainer.py", line 1858, in train
return inner_training_loop(
File "trainer.py", line 2248, in inner_training_loop
grad_norm = self.accelerator.clip_grad_norm(
File "accelerator.py", line 2254, in clip_grad_norm
self.unscale_gradients()
File "accelerator.py", line 2204, in unscale_gradients
self.scaler.unscale(opt)
File "grad_scaler.py", line 307, in unscale
optimizer_state["found_inf_per_device"] = self.unscale_grads(
File "grad_scaler.py", line 229, in unscale_grads
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Expected behavior
This doesn't happen with QLora set to True. I'd expect the model to be fine-tuning without error.
cc @pacman100 @muellerzr As error appears to be trainer + qlora related
same error
I found a solution, remove torch_dtype
, and it should work fine!
model = Idefics2ForConditionalGeneration.from_pretrained(
args.model_name,
device_map="auto",
low_cpu_mem_usage=True,
quantization_config=bnb_config if USE_QLORA else None,
)
I'm facing the same issue with torch_dtype=torch.float16
I found a solution, remove
torch_dtype
, and it should work fine!
If torch_dtype=torch.float16
is removed, the model weights take double the memory to load. Is there anyway to train with fp16 weights and LoRA?