YOLO-NAS icon indicating copy to clipboard operation
YOLO-NAS copied to clipboard

Segmentation fault (core dumped) in Validation Phase Epoch 0 with batch size larger than 6

Open Tria91 opened this issue 8 months ago • 1 comments

Hi, I really like your project and your previous help regarding my startup issues. Currently I am running my docker image with the following parameters docker run -it --rm --privileged --ipc=host -e DISPLAY=$DISPLAY -e NVIDIA_DRIVER_CAPABILITIES=all --runtime=nvidia --gpus all -v /tmp/.X11-unix:/tmp/.X11-unix yourDockerImage

using the more capabilities of my graphics card (NVIDA GeForce RTX 4080 SUPER).

I plan to train my model with at least a batch size of 64 (128 would be great) and an image size of 1024px, although I try to increase the batch size from 6 (as suggested in your examples). My dataset contains 1500 images (500 for each train/test/validation). However, when I start the program it fails with a segmentation error in validation 0%:

Image

I used the following parameters: python train.py --data dataset1500/data.yaml --batch 16 --epoch 10 --model yolo_nas_l --size 640.

Do you have any suggestion how I can get this to work?

Kind regards

Tria91 avatar Mar 13 '25 08:03 Tria91

Hi @Tria91 Can you check the output ?

python3 -c "import torch;print(torch.cuda.is_available());print(torch.cuda.get_device_name())"

naseemap47 avatar May 17 '25 06:05 naseemap47