opencompass icon indicating copy to clipboard operation
opencompass copied to clipboard

storycloze_gen 测试不通过,显示torch.distributed.elastic.multiprocessing.api.SignalException [Feature]

Open dh12306 opened this issue 9 months ago • 1 comments

Describe the feature

测试集不通过,报错如下:

 6%|▋         | 117/1871 [30:11<6:33:49, 13.47s/it]Keyword arguments {'add_special_tokens': False} not recognized.
  6%|▋         | 118/1871 [30:23<6:22:25, 13.09s/it]Keyword arguments {'add_special_tokens': False} not recognized.
  6%|▋         | 119/1871 [30:41<7:04:45, 14.55s/it]Keyword arguments {'add_special_tokens': False} not recognized.
  6%|▋         | 120/1871 [30:59<7:34:04, 15.56s/it]Keyword arguments {'add_special_tokens': False} not recognized.
  6%|▋         | 121/1871 [31:17<7:54:24, 16.27s/it]Keyword arguments {'add_special_tokens': False} not recognized.
  7%|▋         | 122/1871 [31:34<8:08:38, 16.76s/it]Keyword arguments {'add_special_tokens': False} not recognized.
  7%|▋         | 123/1871 [31:50<7:58:26, 16.42s/it]Keyword arguments {'add_special_tokens': False} not recognized.
  7%|▋         | 124/1871 [32:04<7:39:55, 15.80s/it]Keyword arguments {'add_special_tokens': False} not recognized.
  7%|▋         | 125/1871 [32:22<7:58:09, 16.43s/it]Keyword arguments {'add_special_tokens': False} not recognized.
  7%|▋         | 126/1871 [32:40<8:10:50, 16.88s/it]Keyword arguments {'add_special_tokens': False} not recognized.
[2024-05-12 13:31:48,848] torch.distributed.elastic.agent.server.api: [WARNING] Received Signals.SIGHUP death signal, shutting down workers
[2024-05-12 13:31:48,849] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 34836 closing signal SIGHUP
Traceback (most recent call last):
  File "/opt/conda/envs/pytorch/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
    result = agent.run()
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 736, in run
    result = self._invoke_run(role)
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 877, in _invoke_run
    time.sleep(monitor_interval)
  File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 34780 got signal: 1

加载的是本地模型,这是运行命令,是不是分布式出问题了?

python run.py --datasets storycloze_gen --hf-path /home/ec2-user/models/Llama-2-13b-chat-hf  \
--tokenizer-path /home/ec2-user/models/Llama-2-13b-chat-hf --model-kwargs device_map='auto' \
 --tokenizer-kwargs padding_side='left' truncation='left' use_fast=False  \
--max-out-len 100  --max-seq-len 2048 --batch-size 1 --no-batch-padding  \
--num-gpus 4  --max-workers-per-gpu 1 --accelerator hf 

Will you implement it?

  • [ ] I would like to implement this feature and create a PR!

dh12306 avatar May 12 '24 14:05 dh12306

i got the same error, do you resolve it ?

belle9217 avatar Sep 11 '24 06:09 belle9217