jetson-inference icon indicating copy to clipboard operation
jetson-inference copied to clipboard

Question about re-training ssd

Open namaws opened this issue 1 year ago • 10 comments

Hello,

I followed the steps to re-train the ssd model with my own dataset and also checked the video tutorial. Does it require to set up the epoch time? And also I wonder why it took me almost three hour to re-train the model with just 200 pictures but on the video it seems pretty fast.

Thank you

namaws avatar Aug 05 '22 18:08 namaws

Hi @namaws, the default number of training epochs with train_ssd.py is 30 if you don't specify it (with the --epochs command-line option). You may need more however if your model isn't accurate enough, and/or add more images to your dataset.

In the youtube video I fast-forwarded btw so the video was not too long.

dusty-nv avatar Aug 05 '22 19:08 dusty-nv

Hello @dusty-nv,

Thank you! I'll stop the training and add more images. Is there a mininum number of images you would recommand to have for that?

namaws avatar Aug 05 '22 19:08 namaws

IIRC I recommend 100 images per class minimum

dusty-nv avatar Aug 05 '22 19:08 dusty-nv

Hello @dusty-nv,

Is it normal to take two hours to just finish epoch 0? (my epoch setup for 5) thank you

namaws avatar Aug 08 '22 01:08 namaws

It shouldn't typically take that long, no. My guess is that your board is low on memory and is swapping out. Did you follow these steps?

  • https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-transfer-learning.md#mounting-swap
  • https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-transfer-learning.md#disabling-the-desktop-gui

dusty-nv avatar Aug 08 '22 11:08 dusty-nv

Hello @dusty-nv,

I've mounted the swap space for 4GB before I trained a few days ago. Do I need to do it everytime before I train? If I am using jetson nano 2GB developer kit, the swap space can also be 4GB? Or is it posssible that I didn't really use GPU in jetson nano for training? Thank you

namaws avatar Aug 08 '22 13:08 namaws

I've mounted the swap space for 4GB before I trained a few days ago. Do I need to do it everytime before I train?

If you edited /etc/fstab like shown in the documentation, then it will be mounted automatically when the system boots

If I am using jetson nano 2GB developer kit, the swap space can also be 4GB?

You can mount more if you want...I would keep an eye on the memory/swap usage with sudo tegrastats to see if you are running low

Or is it posssible that I didn't really use GPU in jetson nano for training?

It will automatically be used - my guess is that the board is just low memory and swapping out a lot

dusty-nv avatar Aug 08 '22 16:08 dusty-nv

Hello @dusty-nv,

So probably if I want to shorten the training time, it is better to use another board?

namaws avatar Aug 08 '22 16:08 namaws

Hello @dusty-nv my code stuck during training not execute at all after few steps using Nvidia jetson nano 4gb model

  1. memory swap already done
  2. also desktop gui disabled but code stil stuck total images <100 three classes used - grab ,process,place command used- root@mlworkx-desktop:/jetson-inference/python/training/detection/ssd# python3 train_ssd.py --dataset-type=voc --data=data/process --model-dir=models/process --batch-size=2 --workers=1 --epochs=1 Thanks in advance

root@mlworkx-desktop:/jetson-inference/python/training/detection/ssd# python3 train_ssd.py --dataset-type=voc --data=data/process --model-dir=models/process --batch-size=2 --workers=1 --epochs=1 2022-09-11 20:29:02 - Using CUDA... 2022-09-11 20:29:02 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=2, checkpoint_folder='models/process', dataset_type='voc', datasets=['data/process'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1, num_workers=1, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005) 2022-09-11 20:29:02 - Prepare training datasets. warning - image 20220911-192612 has no box/labels annotations, ignoring from dataset warning - image 20220911-192618 has no box/labels annotations, ignoring from dataset 2022-09-11 20:29:03 - VOC Labels read from file: ('BACKGROUND', 'grab', 'process', 'place', '') 2022-09-11 20:29:03 - Stored labels into file models/process/labels.txt. 2022-09-11 20:29:03 - Train dataset size: 95 2022-09-11 20:29:03 - Prepare Validation datasets. 2022-09-11 20:29:03 - VOC Labels read from file: ('BACKGROUND', 'grab', 'process', 'place', '') 2022-09-11 20:29:03 - Validation dataset size: 95 2022-09-11 20:29:03 - Build network. 2022-09-11 20:29:03 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth 2022-09-11 20:29:03 - Took 0.51 seconds to load the model. 2022-09-11 20:29:15 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01. 2022-09-11 20:29:15 - Uses CosineAnnealingLR scheduler. 2022-09-11 20:29:15 - Start training from epoch 0. /usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:134: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) /usr/local/lib/python3.6/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead. warnings.warn(warning.format(ret)) 2022-09-11 20:30:01 - Epoch: 0, Step: 10/48, Avg Loss: 11.5137, Avg Regression Loss 3.6990, Avg Classification Loss: 7.8148 2022-09-11 20:30:05 - Epoch: 0, Step: 20/48, Avg Loss: 8.8349, Avg Regression Loss 4.0577, Avg Classification Loss: 4.7772 2022-09-11 20:30:09 - Epoch: 0, Step: 30/48, Avg Loss: 8.6587, Avg Regression Loss 2.7192, Avg Classification Loss: 5.9396 2022-09-11 20:30:13 - Epoch: 0, Step: 40/48, Avg Loss: 6.7744, Avg Regression Loss 2.5685, Avg Classification Loss: 4.2059

mlworkxAI avatar Sep 11 '22 20:09 mlworkxAI

Hi @namaws, can you try running with --batch-size=1 instead? You can also keep an eye on the memory usage with tegrastats.

dusty-nv avatar Sep 12 '22 13:09 dusty-nv