consistency_models icon indicating copy to clipboard operation
consistency_models copied to clipboard

the libfabric EFA provider is operating in a condition that could result in memory corruption or other system errors.

Open kuailexiaohunzi opened this issue 1 year ago • 11 comments

When using CT mode for training, the following errors occur. Does anyone know how to solve them image

kuailexiaohunzi avatar Jun 05 '24 17:06 kuailexiaohunzi

Maybe the version of pytorch or cuda is incorrect

RICKand-MORTY avatar Jun 11 '24 07:06 RICKand-MORTY

Maybe the version of pytorch or cuda is incorrect

The pytorch version is 1.13 and cuda is 11.7, which matches

kuailexiaohunzi avatar Jun 11 '24 11:06 kuailexiaohunzi

是多卡训练吗?多卡训练dist_utils.py那个节点gpu数要改成自己的gpu数,另外命令行的mpiexec -n 4的4也要换成自己的gpu数

RICKand-MORTY avatar Jun 11 '24 14:06 RICKand-MORTY

是多卡训练吗?多卡训练dist_utils.py那个节点gpu数要改成自己的gpu数,另外命令行的mpiexec -n 4的4也要换成自己的gpu数

不是,单卡,我甚至没有用mpiexec -n这个命令

kuailexiaohunzi avatar Jun 11 '24 14:06 kuailexiaohunzi

添加环境变量RDMAV_FORK_SAFE吧看看,可能是为了安全不让直接fork子进程 https://docs.nvidia.com/networking/display/rdmaawareprogrammingv17/ibv_fork_init

RICKand-MORTY avatar Jun 11 '24 14:06 RICKand-MORTY

添加环境变量RDMAV_FORK_SAFE吧看看,可能是为了安全不让直接fork子进程 https://docs.nvidia.com/networking/display/rdmaawareprogrammingv17/ibv_fork_init

OK,之后试试

kuailexiaohunzi avatar Jun 11 '24 14:06 kuailexiaohunzi

添加环境变量RDMAV_FORK_SAFE吧看看,可能是为了安全不让直接fork子进程 https://docs.nvidia.com/networking/display/rdmaawareprogrammingv17/ibv_fork_init

在cm.train文件里添加了,但还是不行,报同样的错误

kuailexiaohunzi avatar Jun 13 '24 15:06 kuailexiaohunzi

添加环境变量RDMAV_FORK_SAFE吧看看,可能是为了安全不让直接fork子进程 https://docs.nvidia.com/networking/display/rdmaawareprogrammingv17/ibv_fork_init

在cm.train文件里添加了,但还是不行,报同样的错误

在/etc/profile里添加,作为系统环境变量

RICKand-MORTY avatar Jun 13 '24 15:06 RICKand-MORTY

嗷嗷,OK

kuailexiaohunzi avatar Jun 13 '24 15:06 kuailexiaohunzi

在/etc/profile里添加,作为系统环境变量

记得保存后用source刷新一下

RICKand-MORTY avatar Jun 13 '24 15:06 RICKand-MORTY

OK,感谢

kuailexiaohunzi avatar Jun 13 '24 15:06 kuailexiaohunzi