opencompass icon indicating copy to clipboard operation
opencompass copied to clipboard

[Bug] I downloaded the Llama-7b model on the huggingface to the local location, and then modified the path to the file location. After running the code, the following error occurred. How can I solve it?

Open Rain19981998 opened this issue 2 years ago • 5 comments

先决条件

  • [X] 我已经搜索过 问题讨论 但未得到预期的帮助。
  • [X] 错误在 最新版本 中尚未被修复。

问题类型

我正在使用官方支持的任务/模型/数据集进行评估。

环境

python

重现问题 - 代码/配置示例

python run.py --datasets ceval_ppl --hf-path /root/pruning/llama-7b --tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True --model-kwargs device_map='auto' --max-seq-len 2048 --max-out-len 100 --batch-size 64 --num-gpus 1

重现问题 - 命令或脚本

python run.py --datasets ceval_ppl --hf-path /root/pruning/llama-7b --tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True --model-kwargs device_map='auto' --max-seq-len 2048 --max-out-len 100 --batch-size 1--num-gpus 1

重现问题 - 错误信息

WX20231025-103210@2x

其他信息

No response

Rain19981998 avatar Oct 25 '23 02:10 Rain19981998

Please show us the content of outputs/blabla/logs/infer/blabla/blabla.out and outputs/blabla/logs/eval/blabla/blabla.out

Leymore avatar Oct 25 '23 02:10 Leymore

GPU: A100 40GB Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 626117) of binary: /root/.local/conda/envs/pytorch/bin/python Traceback (most recent call last): File "/root/.local/bin/torchrun", line 8, in sys.exit(main()) File "/root/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/root/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/root/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/root/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Rain19981998 avatar Oct 25 '23 08:10 Rain19981998

I have the same problem (OpenICLEval fail), can you solve it?

SijunWang avatar Nov 08 '23 10:11 SijunWang

Please show us the content of outputs/blabla/logs/infer/blabla/blabla.out and outputs/blabla/logs/eval/blabla/blabla.out

'torchrun' 不是内部或外部命令,也不是可运行的程序或批处理文件。

SijunWang avatar Nov 08 '23 11:11 SijunWang

同问,我的报错是/bin/sh: torchrun: command not found,查了下说torch1.9.1之后就支持torchrun了,我是torch2.0.1,还是报错

LianghuiGuo avatar Nov 09 '23 08:11 LianghuiGuo