ssd.pytorch When I run train.py for about 40 iters, it got an error and the program break

When I run train.py for about 40 iters, it got an error and the program break

Open Bigwode opened this issue 6 years ago • 12 comments

File "train.py", line 255, in train() File "train.py", line 165, in train images, targets = next(batch_iterator) File "/home/chenzw/anaconda3/envs/tensor3/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 271, in next raise StopIteration StopIteration

Apr 01 '18 01:04 Bigwode

Wow,another error File "train.py", line 255, in train() File "train.py", line 165, in train images, targets = next(batch_iterator) File "/home/chenzw/anaconda3/envs/tensor/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 271, in next raise StopIteration StopIteration

Apr 01 '18 01:04 Bigwode

I have same problem

Apr 07 '18 09:04 ghost

iter 510 || Loss: 8.8698 || Traceback (most recent call last): File "train.py", line 257, in train() File "train.py", line 184, in train 'append', epoch_size) File "train.py", line 122, in update_vis_plot update=True File "/home/zhenghe/.local/lib/python3.5/site-packages/visdom/init.py", line 206, in result return fn(*args, **kwargs) File "/home/zhenghe/.local/lib/python3.5/site-packages/visdom/init.py", line 769, in line update=update, name=name) File "/home/zhenghe/.local/lib/python3.5/site-packages/visdom/init.py", line 206, in result return fn(*args, **kwargs) File "/home/zhenghe/.local/lib/python3.5/site-packages/visdom/init.py", line 613, in scatter assert win is not None AssertionError

Apr 10 '18 12:04 zhhezhhe

Because when then iter run over one whole epoch, it can't go on iterating from the start point automatically.

here solved this problem

Apr 21 '18 04:04 ShoufaChen

The reason is as ShoufaChen explained.
And I notice that there is another way to handle this problem. I train this network in VOC datasets and I read the config.py in data folder,it expected to iter 120000 times. For the entire datasets it can iter iter_datasets = len(dataset) / batchSize times,to achieve our goal,say,120000,we need to repeat epoch_size = 120000 / iter_datasets,In order to simplify the design, I change the code like this:

iter_datasets = len(dataset) // args.batch_size
epoch_size = cfg['max_iter'] // iter_datasets
for epoch in range(0, epoch_size):
    for i_batch, (images, targets) in enumerate(data_loader):

And notice that this would break down your visdom output and I just turn off visdom,Another thing you should notice is that you should change your code to print infomation(loss, accuracy, time...) that you are interested in.Ok, for visdom,I also notice that you need to have a globle viz at the beginning of train() function for your draw code to use viz if you flag --visdom True .

Hope this helps!

'''''''''''''''''''''''''''''''''''''''''''''' zhuyu72 you can copy another train.py(maybe your_train.py) and change some code like this:

    iter_datasets = len(dataset) // args.batch_size
    epoch_size = cfg['max_iter'] // iter_datasets

    for iteration in range(0, epoch_size):
        for i_batch, (images, targets) in enumerate(data_loader):
            if args.visdom and iteration != 0 and (iteration % epoch_size == 0):
                update_vis_plot(epoch, loc_loss, conf_loss, epoch_plot, None,
                                'append', epoch_size)
                # reset epoch loss counters
                loc_loss = 0
                conf_loss = 0

Apr 26 '18 03:04 HosinPrime

where should i place the code? @HosinPrime

May 13 '18 08:05 zhuyu72

what I changed has solved this problem. You can change the code in for iteration in range(args.start_iter, cfg['max_iter']): use try...except to judge if next() rease StopIteration,and in except reload data.

Jun 05 '18 12:06 visor2020

I made the following revise:

epoch_size = len(data_loader) instead of len(dataset) // args.batch_size
add the try clause when load train data: try: images, targets = next(batch_iterator) except StopIteration: batch_iterator = iter(data_loader) images, targets = next(batch_iterator) except Exception as e: print("Loading data Exception:", e)

Jun 20 '18 01:06 blueardour

@HosinPrime did you run the train.py? why my loss always between 0.5-0.8,

epoch:1/766 loss:0.0582 spend time:118.05 epoch:2/766 loss:0.0560 spend time:238.11 epoch:3/766 loss:0.0545 spend time:356.86 epoch:4/766 loss:0.0486 spend time:476.96 epoch:5/766 loss:0.0508 spend time:596.04 epoch:6/766 loss:0.0613 spend time:716.75 epoch:7/766 loss:0.0544 spend time:836.07 epoch:8/766 loss:0.0470 spend time:956.83 epoch:9/766 loss:0.0610 spend time:1076.16 epoch:10/766 loss:0.0708 spend time:1196.66 epoch:11/766 loss:0.0707 spend time:1317.04 epoch:12/766 loss:0.0698 spend time:1436.10 epoch:13/766 loss:0.0508 spend time:1556.46 epoch:14/766 loss:0.0824 spend time:1675.38 epoch:15/766 loss:0.0627 spend time:1795.84 epoch:16/766 loss:0.0748 spend time:1914.88 epoch:17/766 loss:0.0557 spend time:2035.31 epoch:18/766 loss:0.0653 spend time:2154.38 epoch:19/766 loss:0.0684 spend time:2274.93 epoch:20/766 loss:0.0701 spend time:2394.18 epoch:21/766 loss:0.0529 spend time:2514.80 epoch:22/766 loss:0.0538 spend time:2634.17 epoch:23/766 loss:0.0456 spend time:2754.80 epoch:24/766 loss:0.0572 spend time:2875.39 epoch:25/766 loss:0.0626 spend time:2994.57 epoch:26/766 loss:0.0653 spend time:3115.05 epoch:27/766 loss:0.0566 spend time:3234.57 epoch:28/766 loss:0.0501 spend time:3355.59 epoch:29/766 loss:0.0527 spend time:3475.41

Aug 03 '18 06:08 hust-kevin

@HosinPrime 你好，我在修改了batch_size，并且没有修改网络的情况下进行训练，但是训练得到的损失很差，请问这是什么原因？

Oct 20 '20 14:10 ghost

@HosinPrime 你好，我在修改了batch_size，并且没有修改网络的情况下进行训练，但是训练得到的损失很差，请问这是什么原因？可能是batch_size太小了影响训练结果

Oct 24 '20 01:10 chnzhero

@HosinPrime 你好，我在修改了batch_size，并且没有修改网络的情况下进行训练，但是训练得到的损失很差，请问这是什么原因？可能是batch_size太小了影响训练结果

我将batch_size修改为了16进行训练，损失还是无法减小

Oct 28 '20 00:10 ghost

ssd.pytorch ssd.pytorch copied to clipboard

When I run train.py for about 40 iters, it got an error and the program break

ssd.pytorch
ssd.pytorch copied to clipboard