jetson-inference icon indicating copy to clipboard operation
jetson-inference copied to clipboard

Error training ssd-mobilenet from custom dataset

Open e-mily opened this issue 2 years ago • 39 comments

@dusty-nv I followed the tutorial and created train.txt , test.txt , val.txt and trainval.txt in the ImageSets/Main. I even switched to just having default.txt in the ImageSets/Main and I'm still getting the following error. Can you help me?

root@aititx22-desktop:/jetson-inference/python/training/detection/ssd# python3 train_ssd.py --dataset-type=voc --data=data/total-5 --model=models/total-5 --batch-size=2 --workers=1 --epochs=1 2022-02-23 09:22:37 - Using CUDA... 2022-02-23 09:22:37 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=2, checkpoint_folder='models/total-5', dataset_type='voc', datasets=['data/total-5'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1, num_workers=1, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005) 2022-02-23 09:22:37 - Prepare training datasets. Traceback (most recent call last): File "train_ssd.py", line 214, in target_transform=target_transform) File "/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py", line 33, in init raise IOError("missing ImageSet file {:s}".format(image_sets_file)) TypeError: unsupported format string passed to PosixPath.format

e-mily avatar Feb 23 '22 09:02 e-mily

Hmm... hi @e-mily, can you share the output of ls /jetson-inference/python/training/detection/ssd/data/total-5/ImageSets/Main with me?

dusty-nv avatar Feb 23 '22 18:02 dusty-nv

2

e-mily avatar Feb 24 '22 06:02 e-mily

2

sorry @dusty-nv I was able to train because I misplaced my dataset in the wrong folder

e-mily avatar Feb 24 '22 07:02 e-mily

I have other questions to ask:

  1. how do i do image augmentation using the tutorial?
  2. is there a way to do an underfitting and overfitting?
  3. how can i change the number of layers being trained by detectnet?
  4. if the imagesets/main/default.txt, how does the code divide the dataset into train, test, validation? is there a certain percentage to it? (I was only able to train with imagesets/main/default.txt)

e-mily avatar Feb 24 '22 07:02 e-mily

  1. how do i do image augmentation using the tutorial?

Image augmentation is already done automatically by the TrainAugmentation transforms: https://github.com/dusty-nv/pytorch-ssd/blob/3f9ba554e33260c8c493a927d7c4fdaa3f388e72/vision/ssd/data_preprocessing.py#L4

So if you want, you can add to them there.

3. how can i change the number of layers being trained by detectnet?

You would need to change the SSD network definitions under https://github.com/dusty-nv/pytorch-ssd/tree/3f9ba554e33260c8c493a927d7c4fdaa3f388e72/vision/ssd (I have not attempted this)

4. if the imagesets/main/default.txt, how does the code divide the dataset into train, test, validation? is there a certain percentage to it? (I was only able to train with imagesets/main/default.txt)

default.txt uses the same dataset across train and test, so it doesn't split it. If you want it split, you should have different trainval.txt and test.txt files under ImageSets/Main

dusty-nv avatar Feb 24 '22 17:02 dusty-nv

Thank you @dusty-nv. That was really helpful.

But then when I tried to put them into trainval.txt, test.txt, val.txt etc I received the error as stated above.

e-mily avatar Feb 24 '22 18:02 e-mily

When I tried to run livestream upon building the model. I realized my camera feed is flipped. Is there any way to flipped it back? I'm using Jetson TX2

e-mily avatar Feb 24 '22 18:02 e-mily

But then when I tried to put them into trainval.txt, test.txt, val.txt etc I received the error as stated above.

So do you have the file: total-5/ImageSets/Main/trainval.txt and total-5/ImageSets/Main/test.txt ? Does your user have permissions to read them?

They are looked for in the code here: https://github.com/dusty-nv/pytorch-ssd/blob/3f9ba554e33260c8c493a927d7c4fdaa3f388e72/vision/datasets/voc_dataset.py#L22

When I tried to run livestream upon building the model. I realized my camera feed is flipped. Is there any way to flipped it back? I'm using Jetson TX2

Yes, try running it with --input-flip=rotate-180

For more info, see here: https://github.com/dusty-nv/jetson-inference/blob/master/docs/aux-streaming.md#input-options

dusty-nv avatar Feb 24 '22 21:02 dusty-nv

So do you have the file: total-5/ImageSets/Main/trainval.txt and total-5/ImageSets/Main/test.txt ? Does your user have permissions to read them?

I did. But it give TypeError: unsupported format string passed to PosixPath.format
But if i change it to total-5/ImageSets/Main/default.txt then it works! Erm how do i know if user has permission to read them?

e-mily avatar Feb 27 '22 01:02 e-mily

@dusty-nv I realized the models are write-protected. how do i remove that so that i can delete it? because i want to change the parameters and train the model again.

Btw I was able to train with trainval.txt and val.txt! Thank you!

e-mily avatar Feb 27 '22 06:02 e-mily

error I have this error when i try to train the same model with increased epoch value

e-mily avatar Feb 27 '22 08:02 e-mily

if i decrease the workers=0 i still get the same error. I also tried to swap the memory (i don't know if i did it correctly i dont really understand what im looking at) I have an sd card attached to the jetson tx2. will it help?

e-mily avatar Feb 28 '22 08:02 e-mily

I realized the models are write-protected. how do i remove that so that i can delete it?

You can use command like sudo chown -R <your-user> <path-to-model-dir>

if i decrease the workers=0 i still get the same error. I also tried to swap the memory (i don't know if i did it correctly i dont really understand what im looking at)

The killed message you are get normally means the board has run out of memory. I recommend running with --batch-size=1 and --workers=0 to decrease the memory usage. Also here are the instructions for mounting swap, disabling ZRAM, and disabling the desktop GUI:

  • https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-transfer-learning.md#mounting-swap

dusty-nv avatar Feb 28 '22 14:02 dusty-nv

Thank you @dusty-nv ! I was training my model with increasing epoch and i found out that the more epoch i have. when i test my model with test images. I dont see any bounding boxes as all. i dont see any confidence level displayed in the terminal as well. What do i do?

e-mily avatar Mar 01 '22 04:03 e-mily

traffic like this one. Im suppose to have 3 attirbutes but it can only detect 1. I don't know why the bounding box is so small.

detectnet --model=models/5-imagesa/ssd-mobilenet.onnx --labels=models/5-images/labels.txt --input-blob=input_0 --output-cvg=scores --output-bbox=boxes "/jetson-inference/data/imagess/traffic_*.jpeg" /jetson-inference/data/imagess/test2/traffic_%i.jpeg This is the code i ran.

e-mily avatar Mar 01 '22 07:03 e-mily

Its either that or I'm not getting any results at all with increasing epoch. Uploading traffic2.jpeg…

e-mily avatar Mar 01 '22 10:03 e-mily

Can you try deleting the *.engine file from your model's folder and try running detectnet program again?

How many epochs did you train it for? Normally at least 30 is needed for good results. You can run the pytorch-ssd code on a Linux/Ubuntu PC for faster training (you will need to install PyTorch on it and such)

Also, you can use the run_ssd_example.py script to test one of your PyTorch .pth model checkpoints before it gets exported to ONNX. This will help you to confirm if the model is in fact trained to your liking first.

dusty-nv avatar Mar 01 '22 19:03 dusty-nv

Can you give me the full command to run run_ssd_example.py? I tried from 5 epoch and increasing to 50. It only shows accuracy for 5 epoch and 10 epoch. Afterwards it just seems like it couldnt detect anything as it wasn't showing any accuracy figure.

e-mily avatar Mar 02 '22 16:03 e-mily

Can you give me the full command to run run_ssd_example.py?

python3 run_ssd_example.py mb1-ssd <path-to-pth-checkpoint> <path-to-labels.txt> <path-to-test-image>

dusty-nv avatar Mar 02 '22 18:03 dusty-nv

python3 run_ssd_example.py mb1-ssd <path-to-labels.txt>

root@aititx22-desktop:/jetson-inference/python/training/detection/ssd# python3 run_ssd_example.py mb1-ssd models/20-imagesa/mb1-ssd-Epoch-9-Loss-7.462369181893089.pth models/20-imagesa/labels.txt /jetson-inference/data/imagess/test/traffic_%i.jpeg

Traceback (most recent call last): File "run_ssd_example.py", line 50, in <module> image = cv2.cvtColor(orig_image, cv2.COLOR_BGR2RGB) cv2.error: OpenCV(4.5.0) /opt/opencv/modules/imgproc/src/color.cpp:182: error: (-215:Assertion failed) !_src.empty() in function 'cvtColor'

I tried like that but i got this error...

e-mily avatar Mar 03 '22 07:03 e-mily

So i guess the correct command is root@aititx22-desktop:/jetson-inference/python/training/detection/ssd# python3 run_ssd_example.py mb1-ssd models/20-imagesa/mb1-ssd-Epoch-9-Loss-7.462369181893089.pth models/20-imagesa/labels.txt /jetson-inference/data/imagess/traffic_8.jpeg

Inference time: 2.8669397830963135 Found 0 objects. The output image is run_ssd_example_output.jpg

what do i do? i followed through every steps...

I'll try with increasing epochs. Just curious, shouldn't it be able to detect anything even with very low accuracy?

e-mily avatar Mar 03 '22 07:03 e-mily

root@aititx22-desktop:/jetson-inference/python/training/detection/ssd# python3 run_ssd_example.py mb1-ssd models/20-imagesa/mb1-ssd-Epoch-99-Loss-4.31419215780316.pth models/20-imagesa/labels.txt /jetson-inference/data/imagess/traffic_8.jpeg

Inference time: 4.292574882507324 Found 0 objects. The output image is run_ssd_example_output.jpg

still zero objects found after running for 100 epochs...

what did i do wrong?

e-mily avatar Mar 03 '22 16:03 e-mily

How many images are in your dataset? Are the objects easily discernible? Are they small? It seems like the objects you are training it on may be difficult for it to recognize.

dusty-nv avatar Mar 03 '22 17:03 dusty-nv

How many images are in your dataset? Are the objects easily discernible? Are they small? It seems like the objects you are training it on may be difficult for it to recognize.

Im training 20 images for 3 annotations. The objects are not small. Im training it from different distance. Im aware you need at least 100 images per annotations to train but i dont have that much dataset per annotations.

Is there a way to increase the dataset through image augmentation??

I wanna analyze the accuracy with increasing images per annotations and increasing epochs... But i cant get any accuracy out...

e-mily avatar Mar 03 '22 23:03 e-mily

Im training 20 images for 3 annotations. The objects are not small. Im training it from different distance. Im aware you need at least 100 images per annotations to train but i dont have that much dataset per annotations.

OK yes, you are going to need more images in your dataset. What are your 3 object classes? If they are all road signs, that you want to tell apart just by their different text, that may be more challenging for the DNN and you may need even more images in your dataset.

Is there a way to increase the dataset through image augmentation??

The train_ssd.py script already is doing image augmentation

dusty-nv avatar Mar 04 '22 17:03 dusty-nv

i see. I'll try again with increasing image.

Instead of camera stream or test images, can i use video to test the accuracy of my model with detectnet?

If so, what is the command for that?

e-mily avatar Mar 06 '22 19:03 e-mily

Hi @e-mily, detectnet/detectnet.py doesn't have built-in accuracy, because it has no knowledge of the ground-truth data. It is meant for inferencing only. It's on the PyTorch side that has knowledge of the dataset and groundtruth.

dusty-nv avatar Mar 07 '22 16:03 dusty-nv

thank you @dusty-nv. I have another issue. I created a new sets of dataset to increase the number of images and labels. When i try to run train_ssd.py it gives TypeError: unsupported format string passed to PosixPath.__format__ error.

I re-attempt with the old datasets and it works! But i want to use to new datasets.

When i compare between the old and new datasets they look the same to me. So, I don't really know whats the real issue is. What do you think?

e-mily avatar Mar 18 '22 09:03 e-mily

https://drive.google.com/drive/folders/1--DIZr1JPnETLCfGm6gnYrfAuQXxAdRn?usp=sharing

This is the link to my dataset. it would be a great help if you can check it out.

i tried using the command --debug-steps=1 and I also command out the part from voc_dataset.py but Im not sure how to commit the change in the container.

e-mily avatar Mar 18 '22 15:03 e-mily

And also i still can't seem to divide them into trainval.txt and test.txt

e-mily avatar Mar 18 '22 15:03 e-mily

When i try to run train_ssd.py it gives TypeError: unsupported format string passed to PosixPath.__format__ error.

Can you provide the full error/exception output from the console, so I can see where in the code it is happening at?

dusty-nv avatar Mar 18 '22 17:03 dusty-nv

i tried using the command --debug-steps=1 and I also command out the part from voc_dataset.py but Im not sure how to commit the change in the container.

You would want to edit this inside the container using the nano editor, or just run it without container by installing from source. Or I guess you could mount the jetson-inference/pytorch-ssd source code into the container, that would work too.

dusty-nv avatar Mar 18 '22 17:03 dusty-nv

thank you @dusty-nv turns out it was from my dataset. I want to ask how do i train for different models?

e-mily avatar Mar 21 '22 17:03 e-mily

The ssd-mobilenet-v1 is the only network architecture from pytorch-ssd that I have tested & verified is working through the whole pipeline, including the ONNX export from PyTorch and import into TensorRT and runtime pre/post-processing with jetson-inference

dusty-nv avatar Mar 21 '22 17:03 dusty-nv

to @dusty-nv I am at the same spot that opened this thread; I have the line 214 error and I checked my directory and I do have read and write permission with the 4 files in the directory. There were so many other issues listed that I am not sure what solved the problem. Can you tell me what I should try next.

chromaowl avatar Jul 20 '22 18:07 chromaowl

to @dusty-nv - redid the entire process with a simpler set of objects; just 3 styles of batteries with 3 of each in many positions. When I run the train_ssd.py I still get stuck at line 214. I am sure I am missing something simple. Thanks, Stephen

chromaowl avatar Jul 20 '22 22:07 chromaowl

@chromaowl can you provide the terminal log of the error you are getting?

Are you sure you're providing the correct path to your dataset when you launch train_ssd.py?

dusty-nv avatar Jul 21 '22 01:07 dusty-nv

root@VCEDbreadboard:/jetson-inference/python/training/detection/ssd# python3 train_ssd.py --dataset-type=voc --data=data/batteries --model-dir=models/batteries --batch-size=4 --epochs=2 --workers=1 2022-07-21 15:54:03 - Using CUDA... 2022-07-21 15:54:03 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder='models/batteries', dataset_type='voc', datasets=['data/batteries'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=2, num_workers=1, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005) 2022-07-21 15:54:03 - Prepare training datasets. Traceback (most recent call last): File "train_ssd.py", line 214, in target_transform=target_transform) File "/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py", line 47, in init for line in infile: File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128) root@VCEDbreadboard:/jetson-inference/python/training/detection/ssd#

chromaowl avatar Jul 21 '22 16:07 chromaowl

This is the path to my data: root@VCEDbreadboard:/jetson-inference/python/training/detection/ssd# cd data/batteries root@VCEDbreadboard:/jetson-inference/python/training/detection/ssd/data/batteri es# ls -l total 16 drwxr-xr-x 2 root root 4096 Jul 20 21:02 Annotations drwxr-xr-x 3 root root 4096 Jul 20 20:09 ImageSets drwxr-xr-x 2 root root 4096 Jul 20 21:02 JPEGImages -rw-rw-r-- 1 1000 1000 17 Jul 20 21:22 labels.txt root@VCEDbreadboard:/jetson-inference/python/training/detection/ssd/data/batteri es# ^C root@VCEDbreadboard:/jetson-inference/python/training/detection/ssd/data/batteries#

chromaowl avatar Jul 21 '22 16:07 chromaowl