jetson-inference
jetson-inference copied to clipboard
Question about re-training ssd
Hello,
I followed the steps to re-train the ssd model with my own dataset and also checked the video tutorial. Does it require to set up the epoch time? And also I wonder why it took me almost three hour to re-train the model with just 200 pictures but on the video it seems pretty fast.
Thank you
Hi @namaws, the default number of training epochs with train_ssd.py is 30 if you don't specify it (with the --epochs
command-line option). You may need more however if your model isn't accurate enough, and/or add more images to your dataset.
In the youtube video I fast-forwarded btw so the video was not too long.
Hello @dusty-nv,
Thank you! I'll stop the training and add more images. Is there a mininum number of images you would recommand to have for that?
IIRC I recommend 100 images per class minimum
Hello @dusty-nv,
Is it normal to take two hours to just finish epoch 0? (my epoch setup for 5) thank you
It shouldn't typically take that long, no. My guess is that your board is low on memory and is swapping out. Did you follow these steps?
- https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-transfer-learning.md#mounting-swap
- https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-transfer-learning.md#disabling-the-desktop-gui
Hello @dusty-nv,
I've mounted the swap space for 4GB before I trained a few days ago. Do I need to do it everytime before I train? If I am using jetson nano 2GB developer kit, the swap space can also be 4GB? Or is it posssible that I didn't really use GPU in jetson nano for training? Thank you
I've mounted the swap space for 4GB before I trained a few days ago. Do I need to do it everytime before I train?
If you edited /etc/fstab
like shown in the documentation, then it will be mounted automatically when the system boots
If I am using jetson nano 2GB developer kit, the swap space can also be 4GB?
You can mount more if you want...I would keep an eye on the memory/swap usage with sudo tegrastats
to see if you are running low
Or is it posssible that I didn't really use GPU in jetson nano for training?
It will automatically be used - my guess is that the board is just low memory and swapping out a lot
Hello @dusty-nv,
So probably if I want to shorten the training time, it is better to use another board?
Hello @dusty-nv my code stuck during training not execute at all after few steps using Nvidia jetson nano 4gb model
- memory swap already done
- also desktop gui disabled but code stil stuck total images <100 three classes used - grab ,process,place command used- root@mlworkx-desktop:/jetson-inference/python/training/detection/ssd# python3 train_ssd.py --dataset-type=voc --data=data/process --model-dir=models/process --batch-size=2 --workers=1 --epochs=1 Thanks in advance
root@mlworkx-desktop:/jetson-inference/python/training/detection/ssd# python3 train_ssd.py --dataset-type=voc --data=data/process --model-dir=models/process --batch-size=2 --workers=1 --epochs=1
2022-09-11 20:29:02 - Using CUDA...
2022-09-11 20:29:02 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=2, checkpoint_folder='models/process', dataset_type='voc', datasets=['data/process'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1, num_workers=1, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005)
2022-09-11 20:29:02 - Prepare training datasets.
warning - image 20220911-192612 has no box/labels annotations, ignoring from dataset
warning - image 20220911-192618 has no box/labels annotations, ignoring from dataset
2022-09-11 20:29:03 - VOC Labels read from file: ('BACKGROUND', 'grab', 'process', 'place', '')
2022-09-11 20:29:03 - Stored labels into file models/process/labels.txt.
2022-09-11 20:29:03 - Train dataset size: 95
2022-09-11 20:29:03 - Prepare Validation datasets.
2022-09-11 20:29:03 - VOC Labels read from file: ('BACKGROUND', 'grab', 'process', 'place', '')
2022-09-11 20:29:03 - Validation dataset size: 95
2022-09-11 20:29:03 - Build network.
2022-09-11 20:29:03 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2022-09-11 20:29:03 - Took 0.51 seconds to load the model.
2022-09-11 20:29:15 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2022-09-11 20:29:15 - Uses CosineAnnealingLR scheduler.
2022-09-11 20:29:15 - Start training from epoch 0.
/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:134: UserWarning: Detected call of lr_scheduler.step()
before optimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step()
before lr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
/usr/local/lib/python3.6/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
2022-09-11 20:30:01 - Epoch: 0, Step: 10/48, Avg Loss: 11.5137, Avg Regression Loss 3.6990, Avg Classification Loss: 7.8148
2022-09-11 20:30:05 - Epoch: 0, Step: 20/48, Avg Loss: 8.8349, Avg Regression Loss 4.0577, Avg Classification Loss: 4.7772
2022-09-11 20:30:09 - Epoch: 0, Step: 30/48, Avg Loss: 8.6587, Avg Regression Loss 2.7192, Avg Classification Loss: 5.9396
2022-09-11 20:30:13 - Epoch: 0, Step: 40/48, Avg Loss: 6.7744, Avg Regression Loss 2.5685, Avg Classification Loss: 4.2059
Hi @namaws, can you try running with --batch-size=1
instead? You can also keep an eye on the memory usage with tegrastats.