LaVIN
LaVIN copied to clipboard
[BUG] when NaN loss encountered->zeroDivisionError: float division by zero
[12:53:11.383037] NaN loss encountered. Skipping this batch.
Traceback (most recent call last):
File "train.py", line 262, in <module>
main(args)
File "train.py", line 230, in main
train_stats = train_one_epoch(
File "/media/localhost/E/projects/github/multi-modal/vision-language/LaVIN/engine.py", line 36, in train_one_epoch
for data_iter_step, (examples, labels, example_mask,images,indicators) in enumerate(metric_logger.log_every(data_loader, print_freq, header)):
File "/media/localhost/E/projects/github/multi-modal/vision-language/LaVIN/util/misc.py", line 154, in log_every
meters=str(self),
File "/media/localhost/E/projects/github/multi-modal/vision-language/LaVIN/util/misc.py", line 112, in __str__
"{}: {}".format(name, str(meter))
File "/media/localhost/E/projects/github/multi-modal/vision-language/LaVIN/util/misc.py", line 81, in __str__
global_avg=self.global_avg,
File "/media/localhost/E/projects/github/multi-modal/vision-language/LaVIN/util/misc.py", line 67, in global_avg
return self.total / self.count
ZeroDivisionError: float division by zero
script
CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node 1 train.py \
--llm_model 7B\
--llama_model_path ./data/weights/ \
--data_root ./data \
--data_path ./data/alpaca_data.json \
--caption_file ./data/captions.json \
--max_seq_len 512 \
--batch_size 1 \
--accum_iter 32 \
--epochs 20 \
--warmup_epochs 2 \
--blr 9e-3 \
--weight_decay 0.02 \
--output_dir ./data/output/LaVIN-Vicuna-7B-lite/ \
--log_dir ./data/output/LaVIN-Vicuna-7B-lite/logs/ \
--adapter_type attn\
--adapter_dim 8\
--adapter_scale 1\
--n_prompt 6 \
--prompt_format QCM-ALE \
--temperature 10.\
--visual_adapter_type router \
--gradient_checkpointing \
--bits 4bit \
--cpu_load \
--use_vicuna
torchrun --nproc_per_node 1 eval.py \
--ckpt_dir ./data/weights/ \
--llm_model 7B\
--tokenizer_path ./data/weights/vicuna_7B/tokenizer.model \
--data_root ./data \
--caption_file ./data/captions.json \
--adapter_path .data/output/LaVIN-Vicuna-7B-lite/checkpoint-19.pth \
--adapter_type attn \
--adapter_dim 8 \
--adapter_scale 1 \
--prompt_format QCM-ALE \
--max_batch_size 64\
--max_seq_len 512 \
--split test \
--n_prompt 6 \
--temperature 10.\
--visual_adapter_type router\
--bits 4bit \
--cpu_load \
--use_vicuna=True
Hello, It seems you use the Vicuna model as the pre-trained LLM. It's possible that you incorrectly load the vicuna7B model instead of the corresponding delta model, which results in a NAN loss all the time