MedicalGPT
MedicalGPT copied to clipboard
直接运行run_rm.sh,产生关于计算图的RuntimeError
Describe the Question
报错信息如下:
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward
function. Please make sure model parameters are not shared across
multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if
you use multiple checkpoint
functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple
times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 191 has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to
either INFO or DETAIL to print parameter names for further debugging.
Describe your attempts
(1)加载sft阶段,lora合并之后的模型,报错如下:
ValueError: weight is on the meta device, we need a value
to put in on 0.
希望得到您的回复,谢谢~
我本地没有复现问题,然后我在colab上又跑一次,也ok。你可以试试colab环境执行。
我把RM模型的eval功能也补上了:https://github.com/shibing624/MedicalGPT/commit/b1e0de4b203112b73300e569d24a47ece7d9eef0
RM 模型的运行环境需要安装最新的包:
loguru transformers>=4.30.1 datasets tensorboard tqdm>=4.47.0 peft>=0.3.0 accelerate>=0.20.3
RM 模型的运行环境需要安装最新的包:
loguru transformers>=4.30.1 datasets tensorboard tqdm>=4.47.0 peft>=0.3.0 accelerate>=0.20.3
不work。
colab 上调试。