EVA icon indicating copy to clipboard operation
EVA copied to clipboard

用的DOCKER按说明操作的,这是为什么?

Open vtchg002 opened this issue 2 years ago • 6 comments

微信截图_20220809193609

vtchg002 avatar Aug 09 '22 11:08 vtchg002

执行torch.cuda.device_count(),如果结果是0,可能是torch版本与您的显卡不兼容。

Jiaxin-Wen avatar Aug 09 '22 12:08 Jiaxin-Wen

我没用DOCKER用本地电脑按说明搭了环境,测试torch.cuda.device_count()为1,显卡兼容没问题。但是又提示这样的错误:

E:\eva\EVA-main\src>torchrun --master_port 1234 --nproc_per_node 1 E:/eva/EVA-main/src/eva_interactive.py --model-config E:/eva/EVA-main/src/configs/model/eva2.0_model_config.json --model-parallel-size 1 --load E:/eva/EVA-main/checkpoints/eva2.0 -- no_load_strict --distributed-backend nccl --weight-decay 1e-2 --clip-grad 1.0 --tokenizer-path E:/eva/EVA-main/bpe_dialog_new --temperature 0.9 --top_k 0 --top_p 0.9 --num-beams 4 --length-penalty 1.6 --repetition-penalty 1.6 --rule-path E:/eva/EVA -main/rules --fp16 --deepspeed --deepspeed_config E:/eva/EVA-main/src/configs/deepspeed/eva_ds_config.json NOTE: Redirects are currently not supported in Windows or MacOs. [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:1234 (system error: 10049 - 在其上下文中,该请求的地址无效 。). [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:558] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:1234 (system error: 10049 - 在其上下文中,该请求的地址无效 。). Traceback (most recent call last): File "E:/eva/EVA-main/src/eva_interactive.py", line 8, in from arguments import get_args File "E:\eva\EVA-main\src\arguments.py", line 8, in import deepspeed File "c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\deepspeed_init_.py", line 9, in from .runtime.engine import DeepSpeedEngine File "c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\deepspeed\runtime\engine.py", line 92, in class DeepSpeedEngine(Module): File "c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\deepspeed\runtime\engine.py", line 240, in DeepSpeedEngine base=os.path.join(os.environ["HOME"], File "c:\users\administrator\appdata\local\programs\python\python37\lib\os.py", line 681, in getitem raise KeyError(key) from None KeyError: 'HOME' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 41876) of binary: c:\users\administrator\appdata\local\programs\python\python37\python.exe Traceback (most recent call last): File "c:\users\administrator\appdata\local\programs\python\python37\lib\runpy.py", line 193, in run_module_as_main "main", mod_spec) File "c:\users\administrator\appdata\local\programs\python\python37\lib\runpy.py", line 85, in run_code exec(code, run_globals) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\Scripts\torchrun.exe_main.py", line 9, in File "c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\torch\distributed\elastic\multiprocessing\errors_init.py", line 345, in wrapper return f(*args, **kwargs) File "c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\torch\distributed\run.py", line 761, in main run(args) File "c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\torch\distributed\run.py", line 755, in run )(*cmd_args) File "c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\torch\distributed\launcher\api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\torch\distributed\launcher\api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

E:/eva/EVA-main/src/eva_interactive.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2022-08-10_18:23:59 host : DESKTOP-F6LVUH0 rank : 0 (local_rank: 0) exitcode : 1 (pid: 41876) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

vtchg002 avatar Aug 10 '22 10:08 vtchg002

应该是和您使用的操作系统有关,比如您的 log 里面写了 KeyError: 'HOME',说明您用的 Windows 系统没有这个环境变量。请更换系统,或者尝试我们的 huggingface 版本 https://github.com/thu-coai/EVA/tree/huggingface

t1101675 avatar Aug 10 '22 10:08 t1101675

用huggingface没有其它报错了,但是提示显卡内存不够,我是2060 6G内存,这个跑起来最低需要多少G的内存呢?

RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 6.00 GiB total capacity; 5.32 GiB already allocated; 0 bytes free; 5.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

vtchg002 avatar Aug 11 '22 12:08 vtchg002

交互式推理8GB左右

t1101675 avatar Aug 12 '22 02:08 t1101675

getattr(torch.optim,optimizer_name,None) is not None QQ截图20220815185135

vtchg002 avatar Aug 16 '22 10:08 vtchg002

是否按照 readme 里面修复了 deepspeed 的 bug?

t1101675 avatar Aug 17 '22 14:08 t1101675

您好,我想问一下在windows系统上能正常运行吗?

BaiMeiyingxue avatar Feb 05 '23 02:02 BaiMeiyingxue

我们没有在 windows 上面测试过。但是如果做了 docker 应该可以运行

t1101675 avatar Feb 05 '23 03:02 t1101675