LLaVA-NeXT icon indicating copy to clipboard operation
LLaVA-NeXT copied to clipboard

'loss': 0.0 and 'grad_norm' remains the same in all steps in task lora finetuning

Open Jeremy88888 opened this issue 1 year ago • 10 comments

I am training task lora on "liuhaotian/llava-v1.5-13b" by following the same code in LLaVA repo: https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/finetune_task_lora.sh

The above runs fine in LLaVA repo (https://github.com/haotian-liu/LLaVA/tree/main). When I run it in this LLaVA-NeXT repo (slightly modified few lines of code in train.py to include llava model), the training runs but it keeps showing 'loss': 0.0 and same 'grad_norm". Any idea?

Screenshot_20241006_003928_Slack

Jeremy88888 avatar Oct 06 '24 07:10 Jeremy88888

Hi, I got the same issue. Did you manage to resolve this?

refine-counting avatar Oct 29 '24 04:10 refine-counting

Hi, I got the same issue. What kind of GPU are you using?

paperman-anne avatar Oct 29 '24 04:10 paperman-anne

For me, it was the V100 @paperman-anne running DPO training. How about you?

refine-counting avatar Oct 29 '24 04:10 refine-counting

have you solved yet?

zyandtom avatar Jan 07 '25 06:01 zyandtom

I got the same issue. have you solved yet?

wuxi-dixi avatar Feb 17 '25 13:02 wuxi-dixi

I got the same issue. have you solved yet?

你好解决了吗?

maocheno avatar Feb 28 '25 07:02 maocheno

probably max_length is too small, GT will be truncated?

zyandtom avatar Feb 28 '25 07:02 zyandtom

probably max_length is too small, GT will be truncated?

my max_length is 512. But when i use lora , i find that appears /root/lanyun-tmp/env/llava_next/lib/python3.10/site-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( 0%| | 0/28479 [00:00<?, ?it/s]/root/lanyun-tmp/env/llava_next/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( /root/lanyun-tmp/env/llava_next/lib/python3.10/site-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( /root/lanyun-tmp/env/llava_next/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.

maocheno avatar Feb 28 '25 07:02 maocheno

probably max_length is too small, GT will be truncated?

Bro, you are right. When I delete the, everyting is fine. That may be will occupy too much tokens.

maocheno avatar Feb 28 '25 07:02 maocheno

--model_max_length 32768 \ Needs to be set for any SigLIP run or else the loss is constant and grad_norm is 0.0 for any fine-tuning run, not just with LORA finetuning

JiahuiKChen avatar Apr 24 '25 16:04 JiahuiKChen