yolov7
yolov7 copied to clipboard
Multi GPU Training Not Working
Hi, so i tried training a yolov7 custom model on windows and Linux and both failed, also tried on yolov5 with the same error, any help would be appreciated :/. I was also going to try and use docker but to be honest i have no clue what im doing on docker and im still new to it. GPUs: Rtx 2070 Rtx 3060Ti
OS: Windows 10 WSL:Ubuntu
Training Code:
python3 -m torch.distributed.launch --nproc_per_node 2 --master_port 9527 train.py --data data/custom.yaml --workers 12 --device 0,1 --sync-bn --batch-size 16 --epochs 2500 --img 1280 --cfg cfg/training/yolov7-tiny.yaml --hyp data/hyp.scratch.custom.yaml --weights yolov7.pt --name yolov7
Error on Windows
Error on WSL Ubuntu:
Do you solve it?I have the same question with you.I can not solve.
@Ruyii2 No not yet, ill let you know if i figure anything out :/
Hi all, It's not working for me either, but just using the train.py without the -m torch.distributed.launch does. I don't know if the -m torch.distributed.launch makes the paralelism better though
Hi all, @MagicalPotato0001 @egSat I think it's about your Pytorch version, if you use 2.0.0, maybe modify arg local_rank to local-rank in train.py, like this:
parser.add_argument('--sync-bn', action='store_true', help='use SyncBatchNorm, only available in DDP mode')
parser.add_argument('--local-rank', type=int, default=-1, help='DDP parameter, do not modify')
parser.add_argument('--workers', type=int, default=8, help='maximum number of dataloader workers')
I have read from https://github.com/open-mmlab/mmyolo/pull/796/files and https://github.com/XPixelGroup/BasicSR/issues/626
@lnhutnam what does your requirements.txt
file look like?
I have the below torch/torchvision and getting the same error
torch 1.13.1
torchvision 0.14.1
Hi @nahidalam my env configuration is based on the provided requirements.txt of this repo. It uses pytorch 2.0
I think this issuse ocurrs due the version of pytorch 2.x vs 1.x.
If you use Pytorch 1.x, I think it would be good if you use 1.8.1 or some LTS versions.
@lnhutnam do you mind sharing your entire requirements.txt
and the steps you created to build the environment for yolov7?
This is how I created my environment, which works fine for single GPU training
$python3.8 -m venv yolov7training
$source yolov7training/bin/activate
$pip3.8 install -r requirements.txt
What is your cuda
version? Do I need to have cuda-toolkit
installed? I only have the NVIDIA driver
NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8
$ nvcc --version
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit
For anyone looking it in the future, I solved my issue by making sure cuda is on the PATH
export PATH="/usr/local/cuda-11.8/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH"
@lnhutnam Once i get back to working with yolo i will try changing the torch version, i think you are right and its probably just a version issue. :)
@nahidalam Sorry for late, I was very busy yesterday. Here's my requirements.txt requirements.txt
As you said about config CUDA Path, for me, I follow some instruction on the Internet as in order to use CUDA 11.7 and 10.2
# CUDA Loader
function _switch_cuda {
v=$1
export CUDA_HOME="/usr/local/cuda-$v"
export LD_LIBRARY_PATH="/usr/local/cuda-$v/lib64:$LD_LIBRARY_PATH"
export PATH="/usr/local/cuda-$v/bin:$PATH"
nvcc --version
}
# 11.7, 10.2
_switch_cuda 10.2 # change the version of your like to load bash.
@MagicalPotato0001 Ya, we should check version of Pytorch and related dependencies to use for more especially.
Still facing the same issue. Any fixes for this yet ??
My pytorch version is 2.3.0, torchvision is 0.18.0 cuda toolkit 12.1.