alpaca-lora icon indicating copy to clipboard operation
alpaca-lora copied to clipboard

evaluation loss is NaN

Open victorzhz111 opened this issue 1 year ago • 5 comments

When I finetuning the alpaca-alora model, I applied the alora modules on attention layers" {q_proj, v_proj}", and received the evaluation loss as NaN. However, if I applied the alora moduyles on attention layers "{q_proj, v_proj, o_proj, k_proj}", then the evaluations loss becomes normal. I am not sure why this is happening?

victorzhz111 avatar Apr 04 '23 05:04 victorzhz111

I have the same question, have you figured it out? @victorzhz111

LiuPearl1 avatar Apr 07 '23 09:04 LiuPearl1

Not yet, when I applied the alora modules on 4 attention weights in 13B model, the eval loss is NaN again. @LiuPearl1

victorzhz111 avatar Apr 10 '23 06:04 victorzhz111

Not yet, when I applied the alora modules on 4 attention weights in 13B model, the eval loss is NaN again. @LiuPearl1

@victorzhz111 Your GPU is V100?I found that V100 always met this problem. But I don't know why.

LiuPearl1 avatar Apr 10 '23 08:04 LiuPearl1

@LiuPearl1 Yes, maybe it is not amp architecture.

victorzhz111 avatar Apr 11 '23 01:04 victorzhz111

I meet this question, too. My GPU is V100

FredlinT avatar Apr 18 '23 05:04 FredlinT

Im also using V100 Currently i modified the loading model as follow, notice the torch type

    model = LlamaForCausalLM.from_pretrained(
        base_model,
        load_in_8bit=False,
        torch_dtype=torch.float32,
        llm_int8_skip_modules=FULL_FINETUNE_MODULES,
        device_map=device_map,
        cache_dir='../huggingface'
    )

and add this

 model = get_peft_model(model, config).to(torch.float32)

my training loss currently not 0.0 yet so i think this is the fix for V100 but the training speed is very slow

Tamminhdiep97 avatar Jul 07 '23 11:07 Tamminhdiep97