xtuner icon indicating copy to clipboard operation
xtuner copied to clipboard

llava-llama3-8b 微调过程中 loss nan

Open liboaccn opened this issue 1 year ago • 2 comments

截屏2024-10-06 14 22 48

微调llava-llama3-8b的时候 从几个step后就开始loss=nan了 这个可能是什么原因呢?我看github issue也有人遇到类似问题 官方回复是改lr 我现在设置的

# Scheduler & Optimizer
batch_size = 4  # per_device
accumulative_counts = 32*4
dataloader_num_workers = 32
max_epochs = 1
optim_type = AdamW
lr = 2e-6



param_scheduler = [
    dict(
        type=LinearLR,
        start_factor=1e-5,
        by_epoch=True,
        begin=0,
        end=warmup_ratio * max_epochs,
        convert_to_iter_based=True),
    dict(
        type=CosineAnnealingLR,
        eta_min=0.0,
        by_epoch=True,
        begin=warmup_ratio * max_epochs,
        end=max_epochs,
        convert_to_iter_based=True)
]

liboaccn avatar Oct 06 '24 06:10 liboaccn

补充,修改过 clip->siglip


image_processor = dict(
    type=SiglipImageProcessor.from_pretrained,
    pretrained_model_name_or_path=visual_encoder_name_or_path,
    trust_remote_code=True)

model = dict(
    type=LLaVAModel,
    freeze_llm=True,
    freeze_visual_encoder=True,
    llm=dict(
        type=AutoModelForCausalLM.from_pretrained,
        pretrained_model_name_or_path=llm_name_or_path,
        trust_remote_code=True),
    visual_encoder=dict(
        type=SiglipVisionModel.from_pretrained,
        pretrained_model_name_or_path=visual_encoder_name_or_path))

liboaccn avatar Oct 06 '24 06:10 liboaccn

截屏2024-10-06 14 22 48 微调llava-llama3-8b的时候 从几个step后就开始loss=nan了 这个可能是什么原因呢?我看github issue也有人遇到类似问题 官方回复是改lr 我现在设置的
# Scheduler & Optimizer
batch_size = 4  # per_device
accumulative_counts = 32*4
dataloader_num_workers = 32
max_epochs = 1
optim_type = AdamW
lr = 2e-6



param_scheduler = [
    dict(
        type=LinearLR,
        start_factor=1e-5,
        by_epoch=True,
        begin=0,
        end=warmup_ratio * max_epochs,
        convert_to_iter_based=True),
    dict(
        type=CosineAnnealingLR,
        eta_min=0.0,
        by_epoch=True,
        begin=warmup_ratio * max_epochs,
        end=max_epochs,
        convert_to_iter_based=True)
]

大佬可以请问你是用的哪个脚本进行训练吗,我一直启动不了训练一直报错

Franklin-L avatar Jan 25 '25 15:01 Franklin-L