opencompass
opencompass copied to clipboard
storycloze_gen 测试不通过,显示torch.distributed.elastic.multiprocessing.api.SignalException [Feature]
Describe the feature
测试集不通过,报错如下:
6%|▋ | 117/1871 [30:11<6:33:49, 13.47s/it]Keyword arguments {'add_special_tokens': False} not recognized.
6%|▋ | 118/1871 [30:23<6:22:25, 13.09s/it]Keyword arguments {'add_special_tokens': False} not recognized.
6%|▋ | 119/1871 [30:41<7:04:45, 14.55s/it]Keyword arguments {'add_special_tokens': False} not recognized.
6%|▋ | 120/1871 [30:59<7:34:04, 15.56s/it]Keyword arguments {'add_special_tokens': False} not recognized.
6%|▋ | 121/1871 [31:17<7:54:24, 16.27s/it]Keyword arguments {'add_special_tokens': False} not recognized.
7%|▋ | 122/1871 [31:34<8:08:38, 16.76s/it]Keyword arguments {'add_special_tokens': False} not recognized.
7%|▋ | 123/1871 [31:50<7:58:26, 16.42s/it]Keyword arguments {'add_special_tokens': False} not recognized.
7%|▋ | 124/1871 [32:04<7:39:55, 15.80s/it]Keyword arguments {'add_special_tokens': False} not recognized.
7%|▋ | 125/1871 [32:22<7:58:09, 16.43s/it]Keyword arguments {'add_special_tokens': False} not recognized.
7%|▋ | 126/1871 [32:40<8:10:50, 16.88s/it]Keyword arguments {'add_special_tokens': False} not recognized.
[2024-05-12 13:31:48,848] torch.distributed.elastic.agent.server.api: [WARNING] Received Signals.SIGHUP death signal, shutting down workers
[2024-05-12 13:31:48,849] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 34836 closing signal SIGHUP
Traceback (most recent call last):
File "/opt/conda/envs/pytorch/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
result = agent.run()
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 736, in run
result = self._invoke_run(role)
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 877, in _invoke_run
time.sleep(monitor_interval)
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 34780 got signal: 1
加载的是本地模型,这是运行命令,是不是分布式出问题了?
python run.py --datasets storycloze_gen --hf-path /home/ec2-user/models/Llama-2-13b-chat-hf \
--tokenizer-path /home/ec2-user/models/Llama-2-13b-chat-hf --model-kwargs device_map='auto' \
--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False \
--max-out-len 100 --max-seq-len 2048 --batch-size 1 --no-batch-padding \
--num-gpus 4 --max-workers-per-gpu 1 --accelerator hf
Will you implement it?
- [ ] I would like to implement this feature and create a PR!
i got the same error, do you resolve it ?