Chong Chen
Chong Chen
My parameter `run_validation` is True, but the model file still cannot be output. My configuration is as follows: ``` @dataclass class train_config: model_name: str="PATH/to/LLAMA/7B" enable_fsdp: bool=False low_cpu_fsdp: bool=False run_validation: bool=True...
> Hi @BugmakerCC can you check your eval loss and post the log of your training run? We've seen the eval loss turning to Inf which prevents a checkpoint from...
> Yes, your eval loss is NaN so no checkpoint gets saved: > > ``` > evaluating Epoch: 100%|�[32m██████████�[0m| 100/100 [01:28 eval_ppl=tensor(nan, device='cuda:0') eval_epoch_loss=tensor(nan, device='cuda:0') > ``` > > Your...
> Can have many reasons. Are you using the original alpaca json or a modification? Did you figure out why some weights are not initialized? I am using the original...
> > > Hi @BugmakerCC can you check your eval loss and post the log of your training run? We've seen the eval loss turning to Inf which prevents a...