yolov7 icon indicating copy to clipboard operation
yolov7 copied to clipboard

Multi GPU Training Not Working

Open MagicalPotato0001 opened this issue 1 year ago • 11 comments

Hi, so i tried training a yolov7 custom model on windows and Linux and both failed, also tried on yolov5 with the same error, any help would be appreciated :/. I was also going to try and use docker but to be honest i have no clue what im doing on docker and im still new to it. GPUs: Rtx 2070 Rtx 3060Ti

OS: Windows 10 WSL:Ubuntu

Training Code: python3 -m torch.distributed.launch --nproc_per_node 2 --master_port 9527 train.py --data data/custom.yaml --workers 12 --device 0,1 --sync-bn --batch-size 16 --epochs 2500 --img 1280 --cfg cfg/training/yolov7-tiny.yaml --hyp data/hyp.scratch.custom.yaml --weights yolov7.pt --name yolov7

Error on Windows image

Error on WSL Ubuntu: image

MagicalPotato0001 avatar May 11 '23 22:05 MagicalPotato0001

Do you solve it?I have the same question with you.I can not solve.

Ruyii2 avatar May 12 '23 08:05 Ruyii2

@Ruyii2 No not yet, ill let you know if i figure anything out :/

MagicalPotato0001 avatar May 13 '23 20:05 MagicalPotato0001

Hi all, It's not working for me either, but just using the train.py without the -m torch.distributed.launch does. I don't know if the -m torch.distributed.launch makes the paralelism better though

egSat avatar May 30 '23 14:05 egSat

Hi all, @MagicalPotato0001 @egSat I think it's about your Pytorch version, if you use 2.0.0, maybe modify arg local_rank to local-rank in train.py, like this:

parser.add_argument('--sync-bn', action='store_true', help='use SyncBatchNorm, only available in DDP mode')
parser.add_argument('--local-rank', type=int, default=-1, help='DDP parameter, do not modify')
parser.add_argument('--workers', type=int, default=8, help='maximum number of dataloader workers')

I have read from https://github.com/open-mmlab/mmyolo/pull/796/files and https://github.com/XPixelGroup/BasicSR/issues/626

lnhutnam avatar Aug 04 '23 16:08 lnhutnam

@lnhutnam what does your requirements.txt file look like?

I have the below torch/torchvision and getting the same error

torch                     1.13.1
torchvision               0.14.1

nahidalam avatar Sep 14 '23 06:09 nahidalam

Hi @nahidalam my env configuration is based on the provided requirements.txt of this repo. It uses pytorch 2.0 ArcoLinux_2023-09-14_14-59-27

I think this issuse ocurrs due the version of pytorch 2.x vs 1.x.

If you use Pytorch 1.x, I think it would be good if you use 1.8.1 or some LTS versions.

lnhutnam avatar Sep 14 '23 08:09 lnhutnam

@lnhutnam do you mind sharing your entire requirements.txt and the steps you created to build the environment for yolov7?

This is how I created my environment, which works fine for single GPU training

$python3.8 -m venv yolov7training
$source yolov7training/bin/activate
$pip3.8 install -r requirements.txt

What is your cuda version? Do I need to have cuda-toolkit installed? I only have the NVIDIA driver

NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8
$ nvcc --version

Command 'nvcc' not found, but can be installed with:

sudo apt install nvidia-cuda-toolkit

nahidalam avatar Sep 14 '23 13:09 nahidalam

For anyone looking it in the future, I solved my issue by making sure cuda is on the PATH

export PATH="/usr/local/cuda-11.8/bin:$PATH"

export LD_LIBRARY_PATH="/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH"

nahidalam avatar Sep 14 '23 21:09 nahidalam

@lnhutnam Once i get back to working with yolo i will try changing the torch version, i think you are right and its probably just a version issue. :)

MagicalPotato0001 avatar Sep 15 '23 03:09 MagicalPotato0001

@nahidalam Sorry for late, I was very busy yesterday. Here's my requirements.txt requirements.txt

As you said about config CUDA Path, for me, I follow some instruction on the Internet as in order to use CUDA 11.7 and 10.2

# CUDA Loader
function _switch_cuda {
   v=$1
   export CUDA_HOME="/usr/local/cuda-$v"
   export LD_LIBRARY_PATH="/usr/local/cuda-$v/lib64:$LD_LIBRARY_PATH"
   export PATH="/usr/local/cuda-$v/bin:$PATH"
   nvcc --version
}

# 11.7, 10.2
_switch_cuda 10.2 # change the version of your like to load bash.

@MagicalPotato0001 Ya, we should check version of Pytorch and related dependencies to use for more especially.

lnhutnam avatar Sep 16 '23 12:09 lnhutnam

Still facing the same issue. Any fixes for this yet ??

My pytorch version is 2.3.0, torchvision is 0.18.0 cuda toolkit 12.1.

shubh-acad avatar Apr 29 '24 09:04 shubh-acad