hoshi-hiyouga comments

Results 294 comments of


hoshi-hiyouga

使用lora微调，增加--resume_lora_training、--checkpoint_dir后，感觉没有继续训练，epoch还是从头跑了

只能继承权重，不能继承训练进度，请手动减少 epoch 数量

使用lora微调，增加--resume_lora_training、--checkpoint_dir后，感觉没有继续训练，epoch还是从头跑了

@YinSonglin1997 重点是 --checkpoint_dir 参数，而不是那个参数

单机多卡lora报CUDA out of memory

多卡并不能节省单张卡上面的显存，12G 跑 fp16 的 LoRA 有点勉强，试着开一下量化

想咨询一下微调过程中断后（主动中断），是否可以继续进度训练？

--checkpoint_dir 指向断点权重文件夹

我在本地机器上测试了没有问题，我的测试参数是： ```bash #!/bin/bash CUDA_VISIBLE_DEVICES=0 python src/train_sft.py \ --model_name_or_path chatglm2 \ --use_v2 \ --do_train \ --dataset alpaca_gpt4_zh \ --finetuning_type lora \ --lora_rank 32 \ --output_dir out/debug_sft_v2 \ --overwrite_cache \ --overwrite_output_dir \...

使用lora微调GLM2，加载模型报错

@happy-xlf 这个文件大小明显有问题

hoshi-hiyouga

使用lora微调，增加--resume_lora_training、--checkpoint_dir后，感觉没有继续训练，epoch还是从头跑了

使用lora微调，增加--resume_lora_training、--checkpoint_dir后，感觉没有继续训练，epoch还是从头跑了

单机多卡lora报CUDA out of memory

多gpu lora 报错

加大batch_size 速度不变

训练后loss先上下波动，然后突然变成0

怎么实现在多GPU 上微调训练？

想咨询一下微调过程中断后（主动中断），是否可以继续进度训练？

使用lora微调GLM2，加载模型报错

使用lora微调GLM2，加载模型报错