kohya_ss icon indicating copy to clipboard operation
kohya_ss copied to clipboard

Using MulitGPU issues

Open littleyeson opened this issue 1 year ago • 10 comments

I have two GPU card. 2080ti 22G and V100 16G. Both cards can work when used alone. But if you want to use it simultaneously, the following error message will appear. mulit GPU QQ截图20240403170629 QQ截图20240403170646 QQ截图20240403170704 QQ截图20240403170734

littleyeson avatar Apr 03 '24 09:04 littleyeson

According to chat GPT it might be linked to missing NCCL support in the CUDA version you have installed on your environment. Make sure the CUDA version you have installed support NCCL...

https://developer.nvidia.com/nccl

It looks like you've encountered an error while trying to run a distributed training process with PyTorch. The key message here is:

RuntimeError: Distributed package doesn't have NCCL built in

NCCL (NVIDIA Collective Communications Library) is a library that supports multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs.

Here are some steps you could take to resolve the issue:

  1. Ensure NCCL is Installed: Make sure that NCCL is installed on your system. NCCL is usually bundled with the PyTorch binaries if you install PyTorch using Conda or Pip with CUDA support.

  2. Check PyTorch Installation: It might be necessary to reinstall PyTorch and ensure that you are using a version of PyTorch that is compatible with NCCL. You can use conda or pip for the installation and choose the version that includes CUDA support.

  3. Verify CUDA Version: Make sure that the CUDA version on your system is compatible with the version of NCCL and PyTorch you're using.

  4. Environment Variables: Check your environment variables related to NCCL and CUDA (NCCL_DEBUG=INFO can be used to get more detailed logs).

  5. Distributed Backend: When initializing distributed training in PyTorch with torch.distributed.init_process_group, make sure you're specifying backend='nccl' if you're using NVIDIA GPUs.

  6. Check GPU Availability: Make sure that the GPUs are available and not in use by another process. You can use the command nvidia-smi to check the status of the GPUs.

  7. Permissions: Ensure that you have the correct permissions to access the GPUs and the NCCL library.

  8. Update/Reinstall NCCL: If you have an outdated version of NCCL, updating to the latest version might solve the issue.

  9. Check PyTorch Forums/Documentation: If the error persists, check the PyTorch forums or the official documentation for similar issues or reach out for help with the specifics of your setup.

If after trying these steps the issue isn't resolved, please provide more details about your environment such as the versions of PyTorch, NCCL, and CUDA you are using, as well as the specific code snippet where you initialize the distributed process. That would help in diagnosing the problem more accurately.

bmaltais avatar Apr 03 '24 21:04 bmaltais

I don't know how to check the nccl's installatio.I asked chatGPT. It tall me input this code: import torch print(torch.cuda.nccl.version()) There is error show: Traceback (most recent call last): File "", line 1, in File "C:\AI\kohya_ss\venv\lib\site-packages\torch\cuda\nccl.py", line 35, in version ver = torch._C._nccl_version() AttributeError: module 'torch._C' has no attribute '_nccl_version'

I checked this file “C:\AI\kohya_ss\venv\lib\site-packages\torch\cuda\nccl.py “ is in it.

littleyeson avatar Apr 05 '24 15:04 littleyeson

https://developer.nvidia.com/nccl I have visited this website and found that NCCL may not be installable on Windows systems; it is intended for use with Linux systems. There is no corresponding installation program for Windows systems.

littleyeson avatar Apr 05 '24 16:04 littleyeson

https://developer.nvidia.com/nccl I have visited this website and found that NCCL may not be installable on Windows systems; it is intended for use with Linux systems. There is no corresponding installation program for Windows systems.

yeah it will never work in windows natively. you can get ubuntu 22 from the microsoft app store if you want to reset it up under linux

skein12 avatar Apr 06 '24 05:04 skein12

https://developer.nvidia.com/nccl I have visited this website and found that NCCL may not be installable on Windows systems; it is intended for use with Linux systems. There is no corresponding installation program for Windows systems.

yeah it will never work in windows natively. you can get ubuntu 22 from the microsoft app store if you want to reset it up under linux

Is this virtual operating mode available? Can Ubuntu in this mode call GPU and Linux CUDA? Will the performance be very poor? The GPU performance cannot be used up.

littleyeson avatar Apr 07 '24 02:04 littleyeson

https://developer.nvidia.com/nccl I have visited this website and found that NCCL may not be installable on Windows systems; it is intended for use with Linux systems. There is no corresponding installation program for Windows systems.

yeah it will never work in windows natively. you can get ubuntu 22 from the microsoft app store if you want to reset it up under linux

Is this virtual operating mode available? Can Ubuntu in this mode call GPU and Linux CUDA? Will the performance be very poor? The GPU performance cannot be used up.

yes it uses HyperV, so very much a VM. I installed Ubuntu 22 instead of Windows when I found this out and haven't looked back.

also Microsoft I think has CUDA under their VM linux system working fine but yeah I imagine your taking a hit and you need all Linux software and everything

at that point might as well the install the real thing bare metal imo

skein12 avatar Apr 08 '24 02:04 skein12

https://developer.nvidia.com/nccl I have visited this website and found that NCCL may not be installable on Windows systems; it is intended for use with Linux systems. There is no corresponding installation program for Windows systems.

yeah it will never work in windows natively. you can get ubuntu 22 from the microsoft app store if you want to reset it up under linux

Is this virtual operating mode available? Can Ubuntu in this mode call GPU and Linux CUDA? Will the performance be very poor? The GPU performance cannot be used up.

yes it uses HyperV, so very much a VM. I installed Ubuntu 22 instead of Windows when I found this out and haven't looked back.

also Microsoft I think has CUDA under their VM linux system working fine but yeah I imagine your taking a hit and you need all Linux software and everything

at that point might as well the install the real thing bare metal imo

I have installed ubuntu by microsoft store,but press running have error QQ截图20240419023356 QQ截图20240419023421

littleyeson avatar Apr 18 '24 18:04 littleyeson

https://developer.nvidia.com/nccl I have visited this website and found that NCCL may not be installable on Windows systems; it is intended for use with Linux systems. There is no corresponding installation program for Windows systems.

yeah it will never work in windows natively. you can get ubuntu 22 from the microsoft app store if you want to reset it up under linux

Is this virtual operating mode available? Can Ubuntu in this mode call GPU and Linux CUDA? Will the performance be very poor? The GPU performance cannot be used up.

yes it uses HyperV, so very much a VM. I installed Ubuntu 22 instead of Windows when I found this out and haven't looked back. also Microsoft I think has CUDA under their VM linux system working fine but yeah I imagine your taking a hit and you need all Linux software and everything at that point might as well the install the real thing bare metal imo

I have installed ubuntu by microsoft store,but press running have error QQ截图20240419023356 QQ截图20240419023421

I have fixed it this problem. Install the WSL form windows moudul. and upgrade WSL2. Ubuntu can work.there is new problem.How use this to running the kohya tranning program.It cann't load any windows's derives or files. Git clone another kohya problem in this ubuntu system again? This system cann't display any nvidia GPU ?( I try install nvidia liunx drivers and cudda).But still nothing. how to use this ubuntu(WSL)? QQ截图20240421111959

littleyeson avatar Apr 21 '24 03:04 littleyeson

https://github.com/bmaltais/kohya_ss/issues/2364#issue-2255162976 It seems that wsl2 cannot correctly identify the GPU model. I have used the nvidia driver. Moreover, directly installing kohya_ss and running it seems that the graphics card cannot be correctly recognized. I don’t know if I need to install the cuda driver separately.

littleyeson avatar Apr 22 '24 02:04 littleyeson

Yes, you need to install CUDA as specified in the read a under pre requirement for Linux.

bmaltais avatar Apr 22 '24 10:04 bmaltais

Yes, you need to install CUDA as specified in the read a under pre requirement for Linux.

I have installed cuda in linux.But start program cannot find coda

littleyeson avatar May 02 '24 11:05 littleyeson