WSL下无法使用多卡运行

Open gotothehill opened this issue 1 year ago • 0 comments

Reminder

[X] I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.1.dev0
Platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python version: 3.11.9
PyTorch version: 2.4.1+cu121
Transformers version: 4.45.0
Datasets version: 2.21.0
Accelerate version: 0.34.2
PEFT version: 0.12.0
TRL version: 0.9.6

CUDA： nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0

Reproduction

默认校验有问题： /mnt/d/AI-WSL/LLaMA-Factory$ llamafactory-cli version /home/ggec/miniconda3/envs/factory/lib/python3.10/site-packages/torch/cuda/init.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory (Triggered internally at /opt/conda/conda-bld/pytorch_1724789115765/work/c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0 [2024-09-27 17:13:35,737] [WARNING] [real_accelerator.py:162:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it. [2024-09-27 17:13:35,748] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect)

| Welcome to LLaMA Factory, version 0.9.1.dev0 | | | | Project page: https://github.com/hiyouga/LLaMA-Factory |

指定GPU校验正常： /mnt/d/AI-WSL/LLaMA-Factory$ CUDA_VISIBLE_DEVICES=0 llamafactory-cli version [2024-09-27 17:14:05,213] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /home/ggec/miniconda3/envs/factory/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward(ctx, input, weight, bias=None): /home/ggec/miniconda3/envs/factory/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output):

| Welcome to LLaMA Factory, version 0.9.1.dev0 | | | | Project page: https://github.com/hiyouga/LLaMA-Factory |

指定GPU device可以训练： /mnt/d/AI-WSL/LLaMA-Factory$ CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples /train_lora/llama3_lora_sft.yaml

Expected behavior

目前只能指定一个gpu device进行训练，如何能够使用多卡进行训练？

Others

No response

Sep 27 '24 09:09 gotothehill