MedicalGPT 直接运行run_rm.sh，产生关于计算图的RuntimeError

直接运行run_rm.sh，产生关于计算图的RuntimeError

Open zhpmatrix opened this issue 1 year ago • 4 comments

Describe the Question

报错信息如下：

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations. Parameter at index 191 has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.

Describe your attempts

（1）加载sft阶段，lora合并之后的模型，报错如下：

ValueError: weight is on the meta device, we need a value to put in on 0.

希望得到您的回复，谢谢～

Jun 08 '23 02:06 zhpmatrix

我本地没有复现问题，然后我在colab上又跑一次，也ok。你可以试试colab环境执行。

我把RM模型的eval功能也补上了：https://github.com/shibing624/MedicalGPT/commit/b1e0de4b203112b73300e569d24a47ece7d9eef0

Jun 08 '23 07:06 shibing624

RM 模型的运行环境需要安装最新的包：

loguru transformers>=4.30.1 datasets tensorboard tqdm>=4.47.0 peft>=0.3.0 accelerate>=0.20.3

Jun 12 '23 13:06 shibing624

RM 模型的运行环境需要安装最新的包：

loguru transformers>=4.30.1 datasets tensorboard tqdm>=4.47.0 peft>=0.3.0 accelerate>=0.20.3

不work。

Jun 13 '23 10:06 zhpmatrix

colab 上调试。

Jun 13 '23 11:06 shibing624

MedicalGPT MedicalGPT copied to clipboard

直接运行run_rm.sh，产生关于计算图的RuntimeError

Describe the Question

Describe your attempts

MedicalGPT
MedicalGPT copied to clipboard