LLaVA
LLaVA copied to clipboard
Abnormal loss rises during finetune_task_lora.sh
Describe the issue
Issue:
I am finetuning LLaVA 1.5 13b using scripts/v1.5/finetune_task_lora.sh
on my custom dataset. Training process looks normal (~0.4) until a iteration (randomly, no pattern found yet), the loss Fly away
and can not descent until the epoch end.
My dataset consists of a single round of chinese conversation. After the finetune procedure, I use serve.cli to check the output of my model. However, it goes like garbled UTF-8 code like:
USER: 描述这幅图像
ASSISTANT: �。��见。 影像可诋:��译��影像可见:皮肤上有一个毛的人的皮肤有几个红色的红色的皮肤状块�。
Any suggestions? Thanks in advance.
Command:
nohup sh scripts/v1_5/finetune_task_lora.sh > finetune.log 2>&1 &
Log:
{'loss': 0.278, 'learning_rate': 0.00017723609012917414, 'epoch': 0.24}
24%|██▍ | 2543/10485 [3:13:38<9:51:41, 4.47s/it]
24%|██▍ | 2544/10485 [3:13:42<9:53:22, 4.48s/it]
{'loss': 0.2561, 'learning_rate': 0.00017721646511498725, 'epoch': 0.24}
24%|██▍ | 2544/10485 [3:13:42<9:53:22, 4.48s/it]
24%|██▍ | 2545/10485 [3:13:47<9:58:11, 4.52s/it]
{'loss': 0.2387, 'learning_rate': 0.00017719683273249262, 'epoch': 0.24}
24%|██▍ | 2545/10485 [3:13:47<9:58:11, 4.52s/it]
24%|██▍ | 2546/10485 [3:13:52<9:56:24, 4.51s/it]
{'loss': 0.2537, 'learning_rate': 0.00017717719298356377, 'epoch': 0.24}
24%|██▍ | 2546/10485 [3:13:52<9:56:24, 4.51s/it]
24%|██▍ | 2547/10485 [3:13:56<10:02:22, 4.55s/it]
{'loss': 0.2363, 'learning_rate': 0.00017715754587007475, 'epoch': 0.24}
24%|██▍ | 2547/10485 [3:13:56<10:02:22, 4.55s/it]
24%|██▍ | 2548/10485 [3:14:01<10:07:30, 4.59s/it]
{'loss': 0.196, 'learning_rate': 0.00017713789139390035, 'epoch': 0.24}
24%|██▍ | 2548/10485 [3:14:01<10:07:30, 4.59s/it]
24%|██▍ | 2549/10485 [3:14:05<9:55:42, 4.50s/it]
{'loss': 0.1796, 'learning_rate': 0.0001771182295569161, 'epoch': 0.24}
24%|██▍ | 2549/10485 [3:14:05<9:55:42, 4.50s/it]
24%|██▍ | 2550/10485 [3:14:10<9:53:36, 4.49s/it]
{'loss': 0.2633, 'learning_rate': 0.0001770985603609982, 'epoch': 0.24}
24%|██▍ | 2550/10485 [3:14:10<9:53:36, 4.49s/it]
24%|██▍ | 2551/10485 [3:14:14<9:53:20, 4.49s/it]
{'loss': 0.2268, 'learning_rate': 0.0001770788838080236, 'epoch': 0.24}
24%|██▍ | 2551/10485 [3:14:14<9:53:20, 4.49s/it]
24%|██▍ | 2552/10485 [3:14:19<9:58:08, 4.52s/it]
{'loss': 0.2249, 'learning_rate': 0.00017705919989986985, 'epoch': 0.24}
24%|██▍ | 2552/10485 [3:14:19<9:58:08, 4.52s/it]
24%|██▍ | 2553/10485 [3:14:23<9:55:14, 4.50s/it]
{'loss': 0.237, 'learning_rate': 0.00017703950863841533, 'epoch': 0.24}
24%|██▍ | 2553/10485 [3:14:23<9:55:14, 4.50s/it]
24%|██▍ | 2554/10485 [3:14:28<9:55:44, 4.51s/it]
{'loss': 0.2058, 'learning_rate': 0.00017701981002553904, 'epoch': 0.24}
24%|██▍ | 2554/10485 [3:14:28<9:55:44, 4.51s/it]
24%|██▍ | 2555/10485 [3:14:32<9:54:32, 4.50s/it]
{'loss': 0.231, 'learning_rate': 0.0001770001040631207, 'epoch': 0.24}
24%|██▍ | 2555/10485 [3:14:32<9:54:32, 4.50s/it]
24%|██▍ | 2556/10485 [3:14:36<9:41:41, 4.40s/it]
{'loss': 0.3225, 'learning_rate': 0.00017698039075304069, 'epoch': 0.24}
24%|██▍ | 2556/10485 [3:14:36<9:41:41, 4.40s/it]
24%|██▍ | 2557/10485 [3:14:41<9:48:28, 4.45s/it]
{'loss': 0.3587, 'learning_rate': 0.0001769606700971802, 'epoch': 0.24}
24%|██▍ | 2557/10485 [3:14:41<9:48:28, 4.45s/it]
24%|██▍ | 2558/10485 [3:14:45<9:48:32, 4.45s/it]
{'loss': 0.2832, 'learning_rate': 0.00017694094209742104, 'epoch': 0.24}
24%|██▍ | 2558/10485 [3:14:45<9:48:32, 4.45s/it]
24%|██▍ | 2559/10485 [3:14:50<9:48:51, 4.46s/it]
{'loss': 0.4236, 'learning_rate': 0.00017692120675564575, 'epoch': 0.24}
24%|██▍ | 2559/10485 [3:14:50<9:48:51, 4.46s/it]
24%|██▍ | 2560/10485 [3:14:55<9:59:36, 4.54s/it]
{'loss': 0.4833, 'learning_rate': 0.00017690146407373748, 'epoch': 0.24}
24%|██▍ | 2560/10485 [3:14:55<9:59:36, 4.54s/it]
24%|██▍ | 2561/10485 [3:14:59<9:58:30, 4.53s/it]
{'loss': 0.5024, 'learning_rate': 0.00017688171405358022, 'epoch': 0.24}
24%|██▍ | 2561/10485 [3:14:59<9:58:30, 4.53s/it]
24%|██▍ | 2562/10485 [3:15:04<10:00:41, 4.55s/it]
{'loss': 0.6037, 'learning_rate': 0.0001768619566970586, 'epoch': 0.24}
24%|██▍ | 2562/10485 [3:15:04<10:00:41, 4.55s/it]
24%|██▍ | 2563/10485 [3:15:08<9:57:45, 4.53s/it]
{'loss': 0.9337, 'learning_rate': 0.00017684219200605792, 'epoch': 0.24}
24%|██▍ | 2563/10485 [3:15:08<9:57:45, 4.53s/it]
24%|██▍ | 2564/10485 [3:15:13<10:06:35, 4.59s/it]
{'loss': 0.6623, 'learning_rate': 0.00017682241998246423, 'epoch': 0.24}
24%|██▍ | 2564/10485 [3:15:13<10:06:35, 4.59s/it]
24%|██▍ | 2565/10485 [3:15:17<10:03:00, 4.57s/it]
{'loss': 0.8022, 'learning_rate': 0.0001768026406281642, 'epoch': 0.24}
24%|██▍ | 2565/10485 [3:15:17<10:03:00, 4.57s/it]
24%|██▍ | 2566/10485 [3:15:22<10:03:32, 4.57s/it]
{'loss': 4.2653, 'learning_rate': 0.00017678285394504535, 'epoch': 0.24}
24%|██▍ | 2566/10485 [3:15:22<10:03:32, 4.57s/it]
24%|██▍ | 2567/10485 [3:15:26<9:59:39, 4.54s/it]
{'loss': 0.77, 'learning_rate': 0.00017676305993499574, 'epoch': 0.24}
24%|██▍ | 2567/10485 [3:15:26<9:59:39, 4.54s/it]
24%|██▍ | 2568/10485 [3:15:31<10:01:29, 4.56s/it]
{'loss': 0.556, 'learning_rate': 0.00017674325859990422, 'epoch': 0.24}
24%|██▍ | 2568/10485 [3:15:31<10:01:29, 4.56s/it]
25%|██▍ | 2569/10485 [3:15:36<10:08:46, 4.61s/it]
{'loss': 0.8223, 'learning_rate': 0.00017672344994166031, 'epoch': 0.25}
25%|██▍ | 2569/10485 [3:15:36<10:08:46, 4.61s/it]
25%|██▍ | 2570/10485 [3:15:41<10:13:47, 4.65s/it]
{'loss': 0.6307, 'learning_rate': 0.0001767036339621542, 'epoch': 0.25}
25%|██▍ | 2570/10485 [3:15:41<10:13:47, 4.65s/it]
25%|██▍ | 2571/10485 [3:15:45<10:04:22, 4.58s/it]
{'loss': 0.7256, 'learning_rate': 0.00017668381066327687, 'epoch': 0.25}
25%|██▍ | 2571/10485 [3:15:45<10:04:22, 4.58s/it]
25%|██▍ | 2572/10485 [3:15:49<10:02:48, 4.57s/it]
{'loss': 5.4546, 'learning_rate': 0.0001766639800469199, 'epoch': 0.25}
25%|██▍ | 2572/10485 [3:15:49<10:02:48, 4.57s/it]
25%|██▍ | 2573/10485 [3:15:54<10:08:43, 4.62s/it]
{'loss': 7.7808, 'learning_rate': 0.0001766441421149756, 'epoch': 0.25}
25%|██▍ | 2573/10485 [3:15:54<10:08:43, 4.62s/it]
25%|██▍ | 2574/10485 [3:15:59<10:08:55, 4.62s/it]
{'loss': 8.2073, 'learning_rate': 0.00017662429686933698, 'epoch': 0.25}
25%|██▍ | 2574/10485 [3:15:59<10:08:55, 4.62s/it]
25%|██▍ | 2575/10485 [3:16:04<10:16:07, 4.67s/it]
{'loss': 4.118, 'learning_rate': 0.0001766044443118978, 'epoch': 0.25}
25%|██▍ | 2575/10485 [3:16:04<10:16:07, 4.67s/it]
25%|██▍ | 2576/10485 [3:16:08<10:21:40, 4.72s/it]
{'loss': 4.8635, 'learning_rate': 0.00017658458444455243, 'epoch': 0.25}
25%|██▍ | 2576/10485 [3:16:08<10:21:40, 4.72s/it]
25%|██▍ | 2577/10485 [3:16:13<10:19:28, 4.70s/it]
{'loss': 5.6341, 'learning_rate': 0.00017656471726919604, 'epoch': 0.25}
25%|██▍ | 2577/10485 [3:16:13<10:19:28, 4.70s/it]
25%|██▍ | 2578/10485 [3:16:18<10:27:23, 4.76s/it]
{'loss': 10.266, 'learning_rate': 0.00017654484278772437, 'epoch': 0.25}
25%|██▍ | 2578/10485 [3:16:18<10:27:23, 4.76s/it]
25%|██▍ | 2579/10485 [3:16:23<10:20:03, 4.71s/it]
{'loss': 8.0332, 'learning_rate': 0.00017652496100203392, 'epoch': 0.25}
25%|██▍ | 2579/10485 [3:16:23<10:20:03, 4.71s/it]
25%|██▍ | 2580/10485 [3:16:27<10:13:08, 4.65s/it]
{'loss': 7.8216, 'learning_rate': 0.00017650507191402194, 'epoch': 0.25}
25%|██▍ | 2580/10485 [3:16:27<10:13:08, 4.65s/it]
25%|██▍ | 2581/10485 [3:16:32<10:24:19, 4.74s/it]
{'loss': 7.7877, 'learning_rate': 0.0001764851755255863, 'epoch': 0.25}
25%|██▍ | 2581/10485 [3:16:32<10:24:19, 4.74s/it]
25%|██▍ | 2582/10485 [3:16:37<10:12:33, 4.65s/it]
{'loss': 8.8278, 'learning_rate': 0.00017646527183862558, 'epoch': 0.25}
25%|██▍ | 2582/10485 [3:16:37<10:12:33, 4.65s/it]
25%|██▍ | 2583/10485 [3:16:41<10:05:41, 4.60s/it]
{'loss': 8.6483, 'learning_rate': 0.0001764453608550391, 'epoch': 0.25}
25%|██▍ | 2583/10485 [3:16:41<10:05:41, 4.60s/it]
25%|██▍ | 2584/10485 [3:16:46<10:08:49, 4.62s/it]
{'loss': 6.5844, 'learning_rate': 0.00017642544257672683, 'epoch': 0.25}
25%|██▍ | 2584/10485 [3:16:46<10:08:49, 4.62s/it]
25%|██▍ | 2585/10485 [3:16:50<10:00:06, 4.56s/it]
{'loss': 8.1825, 'learning_rate': 0.00017640551700558944, 'epoch': 0.25}
25%|██▍ | 2585/10485 [3:16:50<10:00:06, 4.56s/it]
25%|██▍ | 2586/10485 [3:16:55<10:03:34, 4.58s/it]
{'loss': 7.0871, 'learning_rate': 0.00017638558414352835, 'epoch': 0.25}
25%|██▍ | 2586/10485 [3:16:55<10:03:34, 4.58s/it]
25%|██▍ | 2587/10485 [3:16:59<10:02:16, 4.58s/it]
{'loss': 6.7109, 'learning_rate': 0.0001763656439924456, 'epoch': 0.25}
25%|██▍ | 2587/10485 [3:16:59<10:02:16, 4.58s/it]
25%|██▍ | 2588/10485 [3:17:04<9:56:44, 4.53s/it]
{'loss': 6.6635, 'learning_rate': 0.00017634569655424395, 'epoch': 0.25}
Screenshots: You may attach screenshots if it better explains the issue.