Kingsley
Kingsley
You can refer to this PR https://github.com/hiyouga/LLaMA-Factory/pull/9267
This issue seems similar to #5991. In your case, batchsize_per_device is set to 1. GPU utilization will be different due to different sequence length on per gpu.
猜测是训崩了,没崩之前这个loss已经很低了,这个step是在第几个epoch?
See https://github.com/QwenLM/Qwen3/issues/736#issuecomment-2207996348
lora为啥要开z3,看你也不缺显存吧
噢噢 对于这么长的序列那确实需要开,如果怀疑是进程有问题的话把这些问题pid记录一下用py-spy看看具体在执行什么呢
sudo fuser -v /dev/nvidia* 查一下这些进程是不是都挂在你gpu上了
https://github.com/huggingface/transformers/blob/51083d1bac7905aa8316b75f7897bdd4e5302044/src/transformers/models/llava_next/image_processing_llava_next.py#L726C9-L728C10 ```python return BatchFeature( data={"pixel_values": processed_images, "image_sizes": image_sizes}, tensor_type=return_tensors ) ``` 经过了llava-next的image_processor之后应该会存在这个`image_sizes` key的,图片输入正确吗
1. Reasoning model还是通过构造一些长COT数据来sft比较好, 这样更加符合reasoning模型训练时的数据分布。 ```ans: xxxxyyy ``` 3. 看这个loss还是有点过拟合,减小epoch,降低学习率。
看起来是chattemplate.jinja的问题 你和原来的模型的模版对一下diff看看