darknet icon indicating copy to clipboard operation
darknet copied to clipboard

Segmentation fault (core dumped) in training own data yolo2 darknet

Open aminullah6264 opened this issue 6 years ago • 22 comments

i have prepare my data according to the instruction given in this link

https://pjreddie.com/darknet/yolo/

i have downloaded weights also. i have 12GB titanX gpu when i run the darknet for training it give this error. i use following line for training ./darknet detector train cfg/voc.data cfg/yolo-voc.cfg darknet19_448.conv.23

plz how to solve this problem or why it is comming

image

aminullah6264 avatar Nov 07 '17 01:11 aminullah6264

And When i try to run with tiny yolo it return this image

aminullah6264 avatar Nov 07 '17 01:11 aminullah6264

Did you run the examples? Do they work correctly for you? If it is okay for examples, then can you share your cfg/voc.data cfg/yolo-voc.cfg

workingforfood avatar Nov 08 '17 06:11 workingforfood

Dear same thing is coming i run example. This screenshots are same when i run examples.

aminullah6264 avatar Nov 08 '17 14:11 aminullah6264

Enable debug option at Makefile and compile source code again. run darknet in gdb to be able to trace segment fault. The 'run/backtrace/where' commands probably will point the line that rises the fault.

rperrones avatar Nov 09 '17 02:11 rperrones

Aminullah6264, show your train list file.

Dmitrivm avatar Nov 17 '17 09:11 Dmitrivm

Do you solve this problems now? I also have the same problem in v3.Now i can't train my data,so i hope you can help me slove this probel,thanks.

lo-pan avatar Mar 29 '18 02:03 lo-pan

It will be great if you show us your cfg file

ahsan856jalal avatar Mar 29 '18 07:03 ahsan856jalal

`[net]

Testing

batch=24 subdivisions=8

Training

batch=64

subdivisions=8

width=416 height=416 channels=3 momentum=0.9 decay=0.0005 angle=0 saturation = 1.5 exposure = 1.5 hue=.1

learning_rate=0.001 burn_in=1000 max_batches = 500200 policy=steps steps=400000,450000 scales=.1,.1

[convolutional] batch_normalize=1 filters=32 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=2

[convolutional] batch_normalize=1 filters=64 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=2

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=64 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=128 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=2

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=128 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=256 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=2

[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=256 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=512 size=3 stride=1 pad=1 activation=leaky

[maxpool] size=2 stride=2

[convolutional] batch_normalize=1 filters=1024 size=3 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=1024 size=3 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=512 size=1 stride=1 pad=1 activation=leaky

[convolutional] batch_normalize=1 filters=1024 size=3 stride=1 pad=1 activation=leaky

#######

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=1024 activation=leaky

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=1024 activation=leaky

[route] layers=-9

[convolutional] batch_normalize=1 size=1 stride=1 pad=1 filters=64 activation=leaky

[reorg] stride=2

[route] layers=-1,-4

[convolutional] batch_normalize=1 size=3 stride=1 pad=1 filters=1024 activation=leaky

[convolutional] size=1 stride=1 pad=1 filters=40 activation=linear

[region] anchors = 0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828 bias_match=1 classes=3 coords=4 num=5 softmax=1 jitter=.3 rescore=1

object_scale=5 noobject_scale=1 class_scale=1 coord_scale=1

absolute=1 thresh = .6 random=1`

i am facing the same issue of seg fault. this cfg is for 3 classes. i have gtx860 4gb. help plz

dee6600 avatar May 10 '18 01:05 dee6600

i am facing the same issue of segmentation fault. with tried many solutions,the question not solved.finally,i change my label coordinate which is zero to a tiny float,and it's work.i think i will help for somebody.ignore my bad english.

Jerry3062 avatar May 22 '18 02:05 Jerry3062

@Jerry3062 can you explain a bit clearly what to change. what do you mean by label coordinate and in which file it is present? thanks!

saivineethkumar avatar Jun 14 '18 09:06 saivineethkumar

Hi saivineethkumar, in label.txt, I found some coordinates and were zero (0.0). Remove those reference from the file list. Its running without any error so far.

arun-kumark avatar Jun 15 '18 12:06 arun-kumark

Change random flag in the last line of cfg file to 0. Core is getting dumped because image is being resized to very high dimension after some iterations(608 in your case) taking too much memory. If you want random dimensions to increase precision, maybe run the model on cpu instead of gpu.

singhnarotam1997 avatar Jun 16 '18 13:06 singhnarotam1997

@saivineethkumar The annotation file. coordinate means (x,y,w,h) or (x1,y1,x2,y2),i forget yolo3's format. In my dataset,some x or y is zero value.

Jerry3062 avatar Jun 21 '18 08:06 Jerry3062

where present label.txt

Jibin-John avatar Jun 26 '18 10:06 Jibin-John

check out my comment, if it helps; https://github.com/pjreddie/darknet/issues/174#issuecomment-445203621

sachindesh avatar Dec 07 '18 11:12 sachindesh

  1. The issue with 'cannot load images', 'segmentation fault (core dump)', 'cannot fopen', 'cannot open label file', is that the files edited in Windows or any operating system that doesn't support Unix style file formats ('\r' line ending) are transferred to Unix boxes (Ubuntu 16 in my case).
  2. dos2unix, "tr -d '\r' < file > file" tools used on Ubuntu on txt as well as JPG files, but it doesn't work even. Solution Whatever editing/saving of image files, txt files or any other files, including the marking of objects (yolo_mark tool) should be done only using the Ubuntu or like desktops and not on Windows or non-Unix style operating systems. Cheer!!

check out my comment, if it helps; #174 (comment)

This problem solved after I changed yolov3.weights to yolov3-tiny.weights, also I changed yolov3.cfg to yolov3-tiny.cfg. Because my GPU is only 1G memory but yolov3 needs 4G. So if your GPU's memory is lower than 4G, you can try yolov3-tiny

wilkice avatar Jan 24 '19 11:01 wilkice

leg Hey guys, Here am training on small dataset with larger size(3520*4280) with yolov3-tiny & darknet19_448.conv.23 , even i'm facing the same issue(Segmentation fault-core dumped),i made changes in configuration file(random 1 to 0, batch & Subdivision).Can somebody help me to resolve this?

Mahibro avatar Apr 15 '19 07:04 Mahibro

In my case my training data was the culprit. Make sure your training data is correct. Specifically, I removed a class after using it on few of the images, which raised this issue.

Sharev avatar May 09 '19 02:05 Sharev

Use AlexeyAB repo for better exception handling. Some of your data in annotations file might be going out of bound (x, y < 0 / > 1 )

Rapternmn avatar Aug 02 '19 10:08 Rapternmn

For future reference...

corrupted images can also cause a segmentation fault (core dumped) during training (and probably also during detection!).

In my case, after a few iterations (with no clear pattern) training would just halt and output segmentation fault (core dumped). Hope it helps!

Best regards, André

afcarvalho1991 avatar Sep 14 '20 17:09 afcarvalho1991

Had segmentation fault, and found that they were caused by objects that are partially or fully outside of the image. Some of them were caught by yolo and listed as "bad-label" but some of them i had to identify myself. After removing them from the data-set the training succeeds!

pandadom avatar Mar 15 '23 09:03 pandadom

I was having this very similar problem while fine tuning YoloV4. Issue was with the data. .ipynb_checkpoints accidentally got into the train.txt. So I would recommend you to also once look at the images in train.txt file.

I was getting Segmentation fault

rumanxyz avatar Jul 20 '23 15:07 rumanxyz