EVA
EVA copied to clipboard
用的DOCKER按说明操作的,这是为什么?
执行torch.cuda.device_count()
,如果结果是0,可能是torch版本与您的显卡不兼容。
我没用DOCKER用本地电脑按说明搭了环境,测试torch.cuda.device_count()为1,显卡兼容没问题。但是又提示这样的错误:
E:\eva\EVA-main\src>torchrun --master_port 1234 --nproc_per_node 1 E:/eva/EVA-main/src/eva_interactive.py --model-config E:/eva/EVA-main/src/configs/model/eva2.0_model_config.json --model-parallel-size 1 --load E:/eva/EVA-main/checkpoints/eva2.0 --
no_load_strict --distributed-backend nccl --weight-decay 1e-2 --clip-grad 1.0 --tokenizer-path E:/eva/EVA-main/bpe_dialog_new --temperature 0.9 --top_k 0 --top_p 0.9 --num-beams 4 --length-penalty 1.6 --repetition-penalty 1.6 --rule-path E:/eva/EVA
-main/rules --fp16 --deepspeed --deepspeed_config E:/eva/EVA-main/src/configs/deepspeed/eva_ds_config.json
NOTE: Redirects are currently not supported in Windows or MacOs.
[W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:1234 (system error: 10049 - 在其上下文中,该请求的地址无效
。).
[W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:1234 (system error: 10049 - 在其上下文中,该请求的地址无效
。).
Traceback (most recent call last):
File "E:/eva/EVA-main/src/eva_interactive.py", line 8, in
from arguments import get_args
File "E:\eva\EVA-main\src\arguments.py", line 8, in
import deepspeed
File "c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\deepspeed_init_.py", line 9, in
from .runtime.engine import DeepSpeedEngine
File "c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\deepspeed\runtime\engine.py", line 92, in
class DeepSpeedEngine(Module):
File "c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\deepspeed\runtime\engine.py", line 240, in DeepSpeedEngine
base=os.path.join(os.environ["HOME"],
File "c:\users\administrator\appdata\local\programs\python\python37\lib\os.py", line 681, in getitem
raise KeyError(key) from None
KeyError: 'HOME'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 41876) of binary: c:\users\administrator\appdata\local\programs\python\python37\python.exe
Traceback (most recent call last):
File "c:\users\administrator\appdata\local\programs\python\python37\lib\runpy.py", line 193, in run_module_as_main
"main", mod_spec)
File "c:\users\administrator\appdata\local\programs\python\python37\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\Scripts\torchrun.exe_main.py", line 9, in
File "c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\torch\distributed\elastic\multiprocessing\errors_init .py", line 345, in wrapper
return f(*args, **kwargs)
File "c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\torch\distributed\run.py", line 761, in main
run(args)
File "c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\torch\distributed\run.py", line 755, in run
)(*cmd_args)
File "c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\torch\distributed\launcher\api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\torch\distributed\launcher\api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
E:/eva/EVA-main/src/eva_interactive.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2022-08-10_18:23:59 host : DESKTOP-F6LVUH0 rank : 0 (local_rank: 0) exitcode : 1 (pid: 41876) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
应该是和您使用的操作系统有关,比如您的 log 里面写了 KeyError: 'HOME',说明您用的 Windows 系统没有这个环境变量。请更换系统,或者尝试我们的 huggingface 版本 https://github.com/thu-coai/EVA/tree/huggingface
用huggingface没有其它报错了,但是提示显卡内存不够,我是2060 6G内存,这个跑起来最低需要多少G的内存呢?
RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 6.00 GiB total capacity; 5.32 GiB already allocated; 0 bytes free; 5.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
交互式推理8GB左右
getattr(torch.optim,optimizer_name,None) is not None
是否按照 readme 里面修复了 deepspeed 的 bug?
您好,我想问一下在windows系统上能正常运行吗?
我们没有在 windows 上面测试过。但是如果做了 docker 应该可以运行