FlagAI icon indicating copy to clipboard operation
FlagAI copied to clipboard

[Question]: 训练Aquilachat-7b报错问题

Open Maxhyl opened this issue 2 years ago • 12 comments

Description

[INFO] bmtrain_mgpu.sh: hostfile configfile model_name exp_name exp_version bmtrain_mgpu.sh: line 35: ifconfig: command not found /home/edcuser/.conda/envs/pytorch_cuda117/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects --local-rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( usage: launch.py [-h] [--nnodes NNODES] [--nproc-per-node NPROC_PER_NODE] [--rdzv-backend RDZV_BACKEND] [--rdzv-endpoint RDZV_ENDPOINT] [--rdzv-id RDZV_ID] [--rdzv-conf RDZV_CONF] [--standalone] [--max-restarts MAX_RESTARTS] [--monitor-interval MONITOR_INTERVAL] [--start-method {spawn,fork,forkserver}] [--role ROLE] [-m] [--no-python] [--run-path] [--log-dir LOG_DIR] [-r REDIRECTS] [-t TEE] [--node-rank NODE_RANK] [--master-addr MASTER_ADDR] [--master-port MASTER_PORT] [--local-addr LOCAL_ADDR] [--use-env] training_script ... launch.py: error: argument --node-rank/--node_rank: expected one argument

您好,训练的时候报这个错误请问怎么解决?

Alternatives

No response

Maxhyl avatar Jun 27 '23 08:06 Maxhyl

可能是版本问题,可以把env_args的参数--local_rank改成 --local-rank再重新安装

Anhforth avatar Jun 27 '23 10:06 Anhforth

可能是版本问题,可以把env_args的参数--local_rank改成 --local-rank再重新安装

这是已经改了之后运行的结果

Maxhyl avatar Jun 27 '23 10:06 Maxhyl

ifconfig: command not found

ftgreat avatar Jun 27 '23 12:06 ftgreat

抱歉点成关闭了,麻烦再打开issue下

ftgreat avatar Jun 27 '23 12:06 ftgreat

抱歉点成关闭了,麻烦再打开issue下

ifconfig: command not found

这个命令单独运行是可以的

Maxhyl avatar Jun 27 '23 13:06 Maxhyl

请问torch是什么版本,可以尝试切换到1.13.0

Anhforth avatar Jun 28 '23 02:06 Anhforth

请问torch是什么版本,可以尝试切换到1.13.0

2.0的

Maxhyl avatar Jun 28 '23 02:06 Maxhyl

请问可以本机免密登录吗,ssh localhost

BAAI-OpenPlatform avatar Jun 28 '23 03:06 BAAI-OpenPlatform

请问可以本机免密登录吗,ssh localhost

可以的,单独运行运行ifconfig那行命令也能成功运行

Maxhyl avatar Jun 28 '23 07:06 Maxhyl

我也遇到了通用的问题,请问有解决吗?

FURYFOR avatar Jul 04 '23 08:07 FURYFOR

临时的解决方案是把flagai/env_args里面的local_rank改成local-rank

BAAI-OpenPlatform avatar Jul 07 '23 08:07 BAAI-OpenPlatform

ifconfig命令加上路径解决了ifconfig: command not found的问题,修改env_args文件还是无法解决local-rank的问题,flagai版本尝试了1.71,1.72,1.73的版本都未能解决

Maxhyl avatar Jul 12 '23 02:07 Maxhyl

ifconfig命令加上路径解决了ifconfig: command not found的问题,修改env_args文件还是无法解决local-rank的问题,flagai版本尝试了1.71,1.72,1.73的版本都未能解决

修改了env_args之后有pip uninstall flagai(确保删干净)然后直接根目录下python setup.py install来重装flagai吗

Anhforth avatar Jul 12 '23 02:07 Anhforth

ifconfig命令加上路径解决了ifconfig: command not found的问题,修改env_args文件还是无法解决local-rank的问题,flagai版本尝试了1.71,1.72,1.73的版本都未能解决

修改了env_args之后有pip uninstall flagai(确保删干净)然后直接根目录下python setup.py install来重装flagai吗

是的,每次都是卸载之后重新安装的

Maxhyl avatar Jul 12 '23 02:07 Maxhyl

ifconfig命令加上路径解决了ifconfig: command not found的问题,修改env_args文件还是无法解决local-rank的问题,flagai版本尝试了1.71,1.72,1.73的版本都未能解决

修改了env_args之后有pip uninstall flagai(确保删干净)然后直接根目录下python setup.py install来重装flagai吗

是的,每次都是卸载之后重新安装的

下午有空开个腾讯会议对对吗,感觉这个错误比较常见,但我没有复现出来

Anhforth avatar Jul 12 '23 02:07 Anhforth

ifconfig命令加上路径解决了ifconfig: command not found的问题,修改env_args文件还是无法解决local-rank的问题,flagai版本尝试了1.71,1.72,1.73的版本都未能解决

修改了env_args之后有pip uninstall flagai(确保删干净)然后直接根目录下python setup.py install来重装flagai吗

是的,每次都是卸载之后重新安装的

下午有空开个腾讯会议对对吗,感觉这个错误比较常见,但我没有复现出来

可以的,怎么联系您呢

Maxhyl avatar Jul 12 '23 02:07 Maxhyl

我直接把腾讯会议号发你吧,对了你用的torch是2.0以上的吗

Anhforth avatar Jul 12 '23 06:07 Anhforth

我直接把腾讯会议号发你吧,对了你用的torch是2.0以上的吗

好的,torch是2.0.0的

Maxhyl avatar Jul 12 '23 06:07 Maxhyl

要不先试下降到2以下会不会有这个问题

Anhforth avatar Jul 12 '23 06:07 Anhforth

要不先试下降到2以下会不会有这个问题

试过1.13.1的,一样的问题

Maxhyl avatar Jul 12 '23 06:07 Maxhyl

严照东 邀请您参加腾讯会议 会议主题:严照东的快速会议 会议时间:2023/07/12 14:53-15:53 (GMT+08:00) 中国标准时间 - 北京

点击链接直接加入会议: https://meeting.tencent.com/dm/REAgCOoGTGXi

#腾讯会议:359-276-072

复制该信息,打开手机腾讯会议即可参与

Anhforth avatar Jul 12 '23 06:07 Anhforth

三点开始?

Anhforth avatar Jul 12 '23 06:07 Anhforth

三点开始?

我进来了

Maxhyl avatar Jul 12 '23 06:07 Maxhyl

同样的错误提示出现,不知道会议上是否有好的解决方法? 环境:Ubuntu 20.04、conda+Python 3.9.16、conda+Torch 1.10.1(cuda112py39h4de5995_0)、conda+CUDA11.2、conda+cuDNN 8201

Alex98773 avatar Jul 12 '23 08:07 Alex98773

这个问题已经解决了,但是出现了其他的问题,修改了env_args.py文件的local_rank为local-rank,aquila-chat.py文件开头添加import sys sys.path.append("项目路径Flagai-master") 修改bmtrain_mgpu.sh文件,NODE_ADDR直接写死ip,MASTER_PORT不要写成连接端口号,否则会提示被占用,后面加载模型就报错了:

TypeError: load_all() missing 1 required positional argument: 'Loader'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 189110) of binary: /home/edcuser/.conda/envs/pytorch_cuda117/bin/python
Traceback (most recent call last):
  File "/home/edcuser/.conda/envs/pytorch_cuda117/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/edcuser/.conda/envs/pytorch_cuda117/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/edcuser/.conda/envs/pytorch_cuda117/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/edcuser/.conda/envs/pytorch_cuda117/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/edcuser/.conda/envs/pytorch_cuda117/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/edcuser/.conda/envs/pytorch_cuda117/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/edcuser/.conda/envs/pytorch_cuda117/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/edcuser/.conda/envs/pytorch_cuda117/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
aquila_chat.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-14_17:49:57
  host      : localhost.localdomain
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 189110)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Maxhyl avatar Jul 14 '23 01:07 Maxhyl