tfoptflow icon indicating copy to clipboard operation
tfoptflow copied to clipboard

Bad performance on MPI-Sintel

Open xylf opened this issue 5 years ago • 18 comments

Hi, I have used your pretrained model to finetune on MPI-Sintel. The EPE on test set was 6.2. Have you tried it?

xylf avatar Jan 05 '19 08:01 xylf

To fine-tune on the MPI-Sintel dataset you have to change the dataset options. If you found respective in:

[1] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. "PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume." CVPR 2018 or arXiv:1709.02371](https://arxiv.org/abs/1709.02371)

and set them to:

ds_opts = deepcopy(_DEFAULT_DS_TUNE_OPTIONS)
ds_opts['in_memory'] = False                           
ds_opts['aug_type'] = 'heavy'                       
ds_opts['flipud'] = 0              # Only apply horizontal flipping for data augmentation, see [1]
ds_opts['translate'] = (0,0)   # Only apply horizontal flipping for data augmentation, see [1]
ds_opts['scale'] = (0,0)         # Only apply horizontal flipping for data augmentation, see [1]
ds_opts['batch_size'] = batch_size * len(gpu_devices)  
ds_opts['crop_preproc'] = (384, 768)                  # Crop to described in [1]
ds_opts['batch_size'] = 4

and

# Robust loss as described doesn't work, so try the following:
nn_opts['loss_fn'] = 'loss_multiscale' 
nn_opts['q'] = 0.4 # see[1]
nn_opts['epsilon'] = 0.01 # see[1]

By fine-tuning on clean and final and evaluating on the training data I got:

  • clean 1.4 EPE
  • final 1.88 EPE

however the results on the test data compared to the reported original are quite low:

  • clean 5.13 (Place 83) in contrast to 4.37 of the original
  • final 6.50 (Place 77) in contrast to 5.04 of the original

I have used the lg-6-2 Net. Could this be an issue of over-fitting? I would appreciate any help to get better results on the test data.

tsenst avatar Jan 17 '19 13:01 tsenst

I think the difference above can be listed as follows: 1.you should take care of the choices of validation set,see https://github.com/lmb-freiburg/flownet2/issues?utf8=%E2%9C%93&q=320

2.data augmentations used in the code have a little different in the original flownet paper,see https://github.com/philferriere/tfoptflow/issues/10 .When training in chairs,you should add that.

jsczzzk avatar Jan 18 '19 02:01 jsczzzk

Thanks I will take a try but you mentioned flownet2. I want to replicate the pwc-net results.

tsenst avatar Jan 18 '19 08:01 tsenst

Did you replicate the results successfully?

jsczzzk avatar Feb 13 '19 05:02 jsczzzk

Do you mean for flownet2 or pwc-net

tsenst avatar Feb 13 '19 09:02 tsenst

pwc-net

jsczzzk avatar Feb 13 '19 14:02 jsczzzk

Unless the one reported abobe, I don't have done any further experiments.

tsenst avatar Feb 13 '19 16:02 tsenst

Thank you so much!

jsczzzk avatar Feb 16 '19 10:02 jsczzzk

@tsenst Hi, I also have this problem. do you find the reason and the corresponding solution?

xianshunw avatar Apr 26 '19 01:04 xianshunw

@tsenst Hi, when I finetune the model on MPI-Sintel with your options The loss and epe are all 'nan' Did you meet this problem?

HeliosZhao avatar Sep 16 '19 10:09 HeliosZhao

@tsenst Hi, I also have this problem. do you find the reason and the corresponding solution?

Hi~ Have you solved the problems ?

Blcony avatar Sep 16 '19 14:09 Blcony

@tsenst Hi, I also have this problem. do you find the reason and the corresponding solution?

Hi~ Have you solved the problems?

No solution, probably because of the data augmentation.

xianshunw avatar Sep 16 '19 16:09 xianshunw

@xianshunw @Blcony Hi, I try to fine-tune or train on MPISintel, but the loss and epe are all ''nan' like this 2019-09-17 00:36:04 Iter 1000 [Train]: loss=nan, epe=nan, lr=0.000100, samples/sec=6.4, sec/step=0.628, eta=17 days, 10:29:35 2019-09-17 00:36:14 Iter 1000 [Val]: loss=nan, epe=nan

The fine-tune code is


from __future__ import absolute_import, division, print_function
import sys
from copy import deepcopy

from dataset_base import _DEFAULT_DS_TUNE_OPTIONS
from dataset_flyingchairs import FlyingChairsDataset
from dataset_flyingthings3d import FlyingThings3DHalfResDataset
from dataset_mixer import MixedDataset
from model_pwcnet import ModelPWCNet, _DEFAULT_PWCNET_FINETUNE_OPTIONS
from dataset_mpisintel import MPISintelDataset

# TODO: You MUST set dataset_root to the correct path on your machine!

_DATASET_ROOT = '/home/zyy/opticalflow/data/'
_MPI_ROOT = _DATASET_ROOT + 'MPI-Sintel'

gpu_devices = ['/device:GPU:0']
controller = '/device:GPU:0'

# TODO: You MUST adjust this setting below based on the amount of memory on your GPU(s)
# Batch size
batch_size = 8

# TODO: You MUST set the batch size based on the capabilities of your GPU(s)
#  Load train dataset
ds_opts = deepcopy(_DEFAULT_DS_TUNE_OPTIONS)
ds_opts['in_memory'] = False                          # Too many samples to keep in memory at once, so don't preload them
ds_opts['aug_type'] = 'heavy'                         # Apply all supported augmentations
ds_opts['batch_size'] = batch_size * len(gpu_devices) # Use a multiple of 8; here, 16 for dual-GPU mode (Titan X & 1080 Ti)
ds_opts['crop_preproc'] = (384,768)		#(256, 448)                  # Crop to a smaller input size
ds_opts['train_mode'] = 'fine-tune'
#ds_opts['crop_preproc'] = None

ds_opts['type'] = 'final'
ds_opts['flipud'] = 0              # Only apply horizontal flipping for data augmentation, see [1]
ds_opts['translate'] = (0,0)   # Only apply horizontal flipping for data augmentation, see [1]
ds_opts['scale'] = (0,0)         # Only apply horizontal flipping for data augmentation, see [1]

ds = MPISintelDataset(mode='train_with_val', ds_root=_MPI_ROOT, options=ds_opts)

# Display dataset configuration
ds.print_config()

# Start from the default options
nn_opts = deepcopy(_DEFAULT_PWCNET_FINETUNE_OPTIONS)
nn_opts['verbose'] = True
nn_opts['ckpt_path'] = './models/pwcnet-sm-6-2-multisteps-chairsthingsmix/pwcnet.ckpt-592000'
nn_opts['ckpt_dir'] = './pwcnet-sm-6-2-cyclic-mpisintel_finetuned/MPI-Sintel_onlyfinal'
nn_opts['batch_size'] = ds_opts['batch_size']
nn_opts['x_shape'] = [2, ds_opts['crop_preproc'][0], ds_opts['crop_preproc'][1], 3]
nn_opts['y_shape'] = [ds_opts['crop_preproc'][0], ds_opts['crop_preproc'][1], 2]
nn_opts['use_tf_data'] = True # Use tf.data reader
nn_opts['gpu_devices'] = gpu_devices
nn_opts['controller'] = controller
nn_opts['train_mode'] = 'fine-tune'
#
# # Use the PWC-Net-small model in quarter-resolution mode
nn_opts['use_dense_cx'] = False
nn_opts['use_res_cx'] = False
nn_opts['pyr_lvls'] = 6
nn_opts['flow_pred_lvl'] = 2
#
# # Robust loss as described doesn't work, so try the following:
nn_opts['loss_fn'] = 'loss_multiscale' # 'loss_multiscale' # 'loss_robust' # 'loss_robust'
nn_opts['q'] = 0.4 # 0.4 # 1. # 0.4 # 1.
nn_opts['epsilon'] = 0.01 # 0.01 # 0. # 0.01 # 0.

# Set the learning rate schedule. This schedule is for a single GPU using a batch size of 8.
# Below,we adjust the schedule to the size of the batch and the number of GPUs.
nn_opts['lr_policy'] = 'multisteps'
nn_opts['init_lr'] = 1e-05
nn_opts['lr_boundaries'] = [80000, 120000, 160000, 200000]
nn_opts['lr_values'] = [1e-05, 5e-06, 2.5e-06, 1.25e-06, 6.25e-07]
nn_opts['max_steps'] = 200000

# Below,we adjust the schedule to the size of the batch and our number of GPUs (2).
nn_opts['max_steps'] = int(nn_opts['max_steps'] * 8 / ds_opts['batch_size'])
nn_opts['cyclic_lr_stepsize'] = int(nn_opts['cyclic_lr_stepsize'] * 8 / ds_opts['batch_size'])

# Instantiate the model and display the model configuration
nn = ModelPWCNet(mode='train_with_val', options=nn_opts, dataset=ds)
nn.print_config()

# Train the model
nn.train()

Have you ever met this problem?

HeliosZhao avatar Sep 17 '19 00:09 HeliosZhao

@xianshunw @Blcony Hi, I try to fine-tune or train on MPISintel, but the loss and epe are all ''nan' like this 2019-09-17 00:36:04 Iter 1000 [Train]: loss=nan, epe=nan, lr=0.000100, samples/sec=6.4, sec/step=0.628, eta=17 days, 10:29:35 2019-09-17 00:36:14 Iter 1000 [Val]: loss=nan, epe=nan

The fine-tune code is


from __future__ import absolute_import, division, print_function
import sys
from copy import deepcopy

from dataset_base import _DEFAULT_DS_TUNE_OPTIONS
from dataset_flyingchairs import FlyingChairsDataset
from dataset_flyingthings3d import FlyingThings3DHalfResDataset
from dataset_mixer import MixedDataset
from model_pwcnet import ModelPWCNet, _DEFAULT_PWCNET_FINETUNE_OPTIONS
from dataset_mpisintel import MPISintelDataset

# TODO: You MUST set dataset_root to the correct path on your machine!

_DATASET_ROOT = '/home/zyy/opticalflow/data/'
_MPI_ROOT = _DATASET_ROOT + 'MPI-Sintel'

gpu_devices = ['/device:GPU:0']
controller = '/device:GPU:0'

# TODO: You MUST adjust this setting below based on the amount of memory on your GPU(s)
# Batch size
batch_size = 8

# TODO: You MUST set the batch size based on the capabilities of your GPU(s)
#  Load train dataset
ds_opts = deepcopy(_DEFAULT_DS_TUNE_OPTIONS)
ds_opts['in_memory'] = False                          # Too many samples to keep in memory at once, so don't preload them
ds_opts['aug_type'] = 'heavy'                         # Apply all supported augmentations
ds_opts['batch_size'] = batch_size * len(gpu_devices) # Use a multiple of 8; here, 16 for dual-GPU mode (Titan X & 1080 Ti)
ds_opts['crop_preproc'] = (384,768)		#(256, 448)                  # Crop to a smaller input size
ds_opts['train_mode'] = 'fine-tune'
#ds_opts['crop_preproc'] = None

ds_opts['type'] = 'final'
ds_opts['flipud'] = 0              # Only apply horizontal flipping for data augmentation, see [1]
ds_opts['translate'] = (0,0)   # Only apply horizontal flipping for data augmentation, see [1]
ds_opts['scale'] = (0,0)         # Only apply horizontal flipping for data augmentation, see [1]

ds = MPISintelDataset(mode='train_with_val', ds_root=_MPI_ROOT, options=ds_opts)

# Display dataset configuration
ds.print_config()

# Start from the default options
nn_opts = deepcopy(_DEFAULT_PWCNET_FINETUNE_OPTIONS)
nn_opts['verbose'] = True
nn_opts['ckpt_path'] = './models/pwcnet-sm-6-2-multisteps-chairsthingsmix/pwcnet.ckpt-592000'
nn_opts['ckpt_dir'] = './pwcnet-sm-6-2-cyclic-mpisintel_finetuned/MPI-Sintel_onlyfinal'
nn_opts['batch_size'] = ds_opts['batch_size']
nn_opts['x_shape'] = [2, ds_opts['crop_preproc'][0], ds_opts['crop_preproc'][1], 3]
nn_opts['y_shape'] = [ds_opts['crop_preproc'][0], ds_opts['crop_preproc'][1], 2]
nn_opts['use_tf_data'] = True # Use tf.data reader
nn_opts['gpu_devices'] = gpu_devices
nn_opts['controller'] = controller
nn_opts['train_mode'] = 'fine-tune'
#
# # Use the PWC-Net-small model in quarter-resolution mode
nn_opts['use_dense_cx'] = False
nn_opts['use_res_cx'] = False
nn_opts['pyr_lvls'] = 6
nn_opts['flow_pred_lvl'] = 2
#
# # Robust loss as described doesn't work, so try the following:
nn_opts['loss_fn'] = 'loss_multiscale' # 'loss_multiscale' # 'loss_robust' # 'loss_robust'
nn_opts['q'] = 0.4 # 0.4 # 1. # 0.4 # 1.
nn_opts['epsilon'] = 0.01 # 0.01 # 0. # 0.01 # 0.

# Set the learning rate schedule. This schedule is for a single GPU using a batch size of 8.
# Below,we adjust the schedule to the size of the batch and the number of GPUs.
nn_opts['lr_policy'] = 'multisteps'
nn_opts['init_lr'] = 1e-05
nn_opts['lr_boundaries'] = [80000, 120000, 160000, 200000]
nn_opts['lr_values'] = [1e-05, 5e-06, 2.5e-06, 1.25e-06, 6.25e-07]
nn_opts['max_steps'] = 200000

# Below,we adjust the schedule to the size of the batch and our number of GPUs (2).
nn_opts['max_steps'] = int(nn_opts['max_steps'] * 8 / ds_opts['batch_size'])
nn_opts['cyclic_lr_stepsize'] = int(nn_opts['cyclic_lr_stepsize'] * 8 / ds_opts['batch_size'])

# Instantiate the model and display the model configuration
nn = ModelPWCNet(mode='train_with_val', options=nn_opts, dataset=ds)
nn.print_config()

# Train the model
nn.train()

Have you ever met this problem?

Hi~ I haven't tried to finetune on MIP-sintel, but maybe this link (https://github.com/philferriere/tfoptflow/issues/7) is helpful for you. Maybe you can try it.

Blcony avatar Sep 17 '19 00:09 Blcony

@Blcony by any chance did you manage to implement this (#7) solution? Could you post here the code? I think it should be added between line 549 and 553 of model_pwcnet

Thanks, Stefano

jeffbaena avatar Sep 17 '19 06:09 jeffbaena

@xianshunw @Blcony Hi, I try to fine-tune or train on MPISintel, but the loss and epe are all ''nan' like this 2019-09-17 00:36:04 Iter 1000 [Train]: loss=nan, epe=nan, lr=0.000100, samples/sec=6.4, sec/step=0.628, eta=17 days, 10:29:35 2019-09-17 00:36:14 Iter 1000 [Val]: loss=nan, epe=nan The fine-tune code is


from __future__ import absolute_import, division, print_function
import sys
from copy import deepcopy

from dataset_base import _DEFAULT_DS_TUNE_OPTIONS
from dataset_flyingchairs import FlyingChairsDataset
from dataset_flyingthings3d import FlyingThings3DHalfResDataset
from dataset_mixer import MixedDataset
from model_pwcnet import ModelPWCNet, _DEFAULT_PWCNET_FINETUNE_OPTIONS
from dataset_mpisintel import MPISintelDataset

# TODO: You MUST set dataset_root to the correct path on your machine!

_DATASET_ROOT = '/home/zyy/opticalflow/data/'
_MPI_ROOT = _DATASET_ROOT + 'MPI-Sintel'

gpu_devices = ['/device:GPU:0']
controller = '/device:GPU:0'

# TODO: You MUST adjust this setting below based on the amount of memory on your GPU(s)
# Batch size
batch_size = 8

# TODO: You MUST set the batch size based on the capabilities of your GPU(s)
#  Load train dataset
ds_opts = deepcopy(_DEFAULT_DS_TUNE_OPTIONS)
ds_opts['in_memory'] = False                          # Too many samples to keep in memory at once, so don't preload them
ds_opts['aug_type'] = 'heavy'                         # Apply all supported augmentations
ds_opts['batch_size'] = batch_size * len(gpu_devices) # Use a multiple of 8; here, 16 for dual-GPU mode (Titan X & 1080 Ti)
ds_opts['crop_preproc'] = (384,768)		#(256, 448)                  # Crop to a smaller input size
ds_opts['train_mode'] = 'fine-tune'
#ds_opts['crop_preproc'] = None

ds_opts['type'] = 'final'
ds_opts['flipud'] = 0              # Only apply horizontal flipping for data augmentation, see [1]
ds_opts['translate'] = (0,0)   # Only apply horizontal flipping for data augmentation, see [1]
ds_opts['scale'] = (0,0)         # Only apply horizontal flipping for data augmentation, see [1]

ds = MPISintelDataset(mode='train_with_val', ds_root=_MPI_ROOT, options=ds_opts)

# Display dataset configuration
ds.print_config()

# Start from the default options
nn_opts = deepcopy(_DEFAULT_PWCNET_FINETUNE_OPTIONS)
nn_opts['verbose'] = True
nn_opts['ckpt_path'] = './models/pwcnet-sm-6-2-multisteps-chairsthingsmix/pwcnet.ckpt-592000'
nn_opts['ckpt_dir'] = './pwcnet-sm-6-2-cyclic-mpisintel_finetuned/MPI-Sintel_onlyfinal'
nn_opts['batch_size'] = ds_opts['batch_size']
nn_opts['x_shape'] = [2, ds_opts['crop_preproc'][0], ds_opts['crop_preproc'][1], 3]
nn_opts['y_shape'] = [ds_opts['crop_preproc'][0], ds_opts['crop_preproc'][1], 2]
nn_opts['use_tf_data'] = True # Use tf.data reader
nn_opts['gpu_devices'] = gpu_devices
nn_opts['controller'] = controller
nn_opts['train_mode'] = 'fine-tune'
#
# # Use the PWC-Net-small model in quarter-resolution mode
nn_opts['use_dense_cx'] = False
nn_opts['use_res_cx'] = False
nn_opts['pyr_lvls'] = 6
nn_opts['flow_pred_lvl'] = 2
#
# # Robust loss as described doesn't work, so try the following:
nn_opts['loss_fn'] = 'loss_multiscale' # 'loss_multiscale' # 'loss_robust' # 'loss_robust'
nn_opts['q'] = 0.4 # 0.4 # 1. # 0.4 # 1.
nn_opts['epsilon'] = 0.01 # 0.01 # 0. # 0.01 # 0.

# Set the learning rate schedule. This schedule is for a single GPU using a batch size of 8.
# Below,we adjust the schedule to the size of the batch and the number of GPUs.
nn_opts['lr_policy'] = 'multisteps'
nn_opts['init_lr'] = 1e-05
nn_opts['lr_boundaries'] = [80000, 120000, 160000, 200000]
nn_opts['lr_values'] = [1e-05, 5e-06, 2.5e-06, 1.25e-06, 6.25e-07]
nn_opts['max_steps'] = 200000

# Below,we adjust the schedule to the size of the batch and our number of GPUs (2).
nn_opts['max_steps'] = int(nn_opts['max_steps'] * 8 / ds_opts['batch_size'])
nn_opts['cyclic_lr_stepsize'] = int(nn_opts['cyclic_lr_stepsize'] * 8 / ds_opts['batch_size'])

# Instantiate the model and display the model configuration
nn = ModelPWCNet(mode='train_with_val', options=nn_opts, dataset=ds)
nn.print_config()

# Train the model
nn.train()

Have you ever met this problem?

Hi~ I haven't tried to finetune on MIP-sintel, but maybe this link (#7) is helpful for you. Maybe you can try it.

Well, maybe that issue does not solve my problem. I encounter this problem as early as 200 iteration。 Like this


Start finetuning...
2019-09-17 22:40:18 Iter 100 [Train]: loss=3.39, epe=4.67, lr=0.000010, samples/sec=3.7, sec/step=1.081, eta=5 days, 0:05:35
2019-09-17 22:41:33 Iter 200 [Train]: loss=nan, epe=nan, lr=0.000010, samples/sec=5.6, sec/step=0.710, eta=3 days, 6:48:41
2019-09-17 22:42:37 Iter 300 [Train]: loss=nan, epe=nan, lr=0.000010, samples/sec=6.8, sec/step=0.590, eta=2 days, 17:29:28
2019-09-17 22:43:51 Iter 400 [Train]: loss=nan, epe=nan, lr=0.000010, samples/sec=5.7, sec/step=0.703, eta=3 days, 5:59:56

Thank you very much, Maybe I need to open a new issue.

HeliosZhao avatar Sep 17 '19 15:09 HeliosZhao

@xianshunw @Blcony Hi, I try to fine-tune or train on MPISintel, but the loss and epe are all ''nan' like this 2019-09-17 00:36:04 Iter 1000 [Train]: loss=nan, epe=nan, lr=0.000100, samples/sec=6.4, sec/step=0.628, eta=17 days, 10:29:35 2019-09-17 00:36:14 Iter 1000 [Val]: loss=nan, epe=nan The fine-tune code is


from __future__ import absolute_import, division, print_function
import sys
from copy import deepcopy

from dataset_base import _DEFAULT_DS_TUNE_OPTIONS
from dataset_flyingchairs import FlyingChairsDataset
from dataset_flyingthings3d import FlyingThings3DHalfResDataset
from dataset_mixer import MixedDataset
from model_pwcnet import ModelPWCNet, _DEFAULT_PWCNET_FINETUNE_OPTIONS
from dataset_mpisintel import MPISintelDataset

# TODO: You MUST set dataset_root to the correct path on your machine!

_DATASET_ROOT = '/home/zyy/opticalflow/data/'
_MPI_ROOT = _DATASET_ROOT + 'MPI-Sintel'

gpu_devices = ['/device:GPU:0']
controller = '/device:GPU:0'

# TODO: You MUST adjust this setting below based on the amount of memory on your GPU(s)
# Batch size
batch_size = 8

# TODO: You MUST set the batch size based on the capabilities of your GPU(s)
#  Load train dataset
ds_opts = deepcopy(_DEFAULT_DS_TUNE_OPTIONS)
ds_opts['in_memory'] = False                          # Too many samples to keep in memory at once, so don't preload them
ds_opts['aug_type'] = 'heavy'                         # Apply all supported augmentations
ds_opts['batch_size'] = batch_size * len(gpu_devices) # Use a multiple of 8; here, 16 for dual-GPU mode (Titan X & 1080 Ti)
ds_opts['crop_preproc'] = (384,768)		#(256, 448)                  # Crop to a smaller input size
ds_opts['train_mode'] = 'fine-tune'
#ds_opts['crop_preproc'] = None

ds_opts['type'] = 'final'
ds_opts['flipud'] = 0              # Only apply horizontal flipping for data augmentation, see [1]
ds_opts['translate'] = (0,0)   # Only apply horizontal flipping for data augmentation, see [1]
ds_opts['scale'] = (0,0)         # Only apply horizontal flipping for data augmentation, see [1]

ds = MPISintelDataset(mode='train_with_val', ds_root=_MPI_ROOT, options=ds_opts)

# Display dataset configuration
ds.print_config()

# Start from the default options
nn_opts = deepcopy(_DEFAULT_PWCNET_FINETUNE_OPTIONS)
nn_opts['verbose'] = True
nn_opts['ckpt_path'] = './models/pwcnet-sm-6-2-multisteps-chairsthingsmix/pwcnet.ckpt-592000'
nn_opts['ckpt_dir'] = './pwcnet-sm-6-2-cyclic-mpisintel_finetuned/MPI-Sintel_onlyfinal'
nn_opts['batch_size'] = ds_opts['batch_size']
nn_opts['x_shape'] = [2, ds_opts['crop_preproc'][0], ds_opts['crop_preproc'][1], 3]
nn_opts['y_shape'] = [ds_opts['crop_preproc'][0], ds_opts['crop_preproc'][1], 2]
nn_opts['use_tf_data'] = True # Use tf.data reader
nn_opts['gpu_devices'] = gpu_devices
nn_opts['controller'] = controller
nn_opts['train_mode'] = 'fine-tune'
#
# # Use the PWC-Net-small model in quarter-resolution mode
nn_opts['use_dense_cx'] = False
nn_opts['use_res_cx'] = False
nn_opts['pyr_lvls'] = 6
nn_opts['flow_pred_lvl'] = 2
#
# # Robust loss as described doesn't work, so try the following:
nn_opts['loss_fn'] = 'loss_multiscale' # 'loss_multiscale' # 'loss_robust' # 'loss_robust'
nn_opts['q'] = 0.4 # 0.4 # 1. # 0.4 # 1.
nn_opts['epsilon'] = 0.01 # 0.01 # 0. # 0.01 # 0.

# Set the learning rate schedule. This schedule is for a single GPU using a batch size of 8.
# Below,we adjust the schedule to the size of the batch and the number of GPUs.
nn_opts['lr_policy'] = 'multisteps'
nn_opts['init_lr'] = 1e-05
nn_opts['lr_boundaries'] = [80000, 120000, 160000, 200000]
nn_opts['lr_values'] = [1e-05, 5e-06, 2.5e-06, 1.25e-06, 6.25e-07]
nn_opts['max_steps'] = 200000

# Below,we adjust the schedule to the size of the batch and our number of GPUs (2).
nn_opts['max_steps'] = int(nn_opts['max_steps'] * 8 / ds_opts['batch_size'])
nn_opts['cyclic_lr_stepsize'] = int(nn_opts['cyclic_lr_stepsize'] * 8 / ds_opts['batch_size'])

# Instantiate the model and display the model configuration
nn = ModelPWCNet(mode='train_with_val', options=nn_opts, dataset=ds)
nn.print_config()

# Train the model
nn.train()

Have you ever met this problem?

Hi~ I haven't tried to finetune on MIP-sintel, but maybe this link (#7) is helpful for you. Maybe you can try it.

Well, maybe that issue does not solve my problem. I encounter this problem as early as 200 iteration。 Like this


Start finetuning...
2019-09-17 22:40:18 Iter 100 [Train]: loss=3.39, epe=4.67, lr=0.000010, samples/sec=3.7, sec/step=1.081, eta=5 days, 0:05:35
2019-09-17 22:41:33 Iter 200 [Train]: loss=nan, epe=nan, lr=0.000010, samples/sec=5.6, sec/step=0.710, eta=3 days, 6:48:41
2019-09-17 22:42:37 Iter 300 [Train]: loss=nan, epe=nan, lr=0.000010, samples/sec=6.8, sec/step=0.590, eta=2 days, 17:29:28
2019-09-17 22:43:51 Iter 400 [Train]: loss=nan, epe=nan, lr=0.000010, samples/sec=5.7, sec/step=0.703, eta=3 days, 5:59:56

Thank you very much, Maybe I need to open a new issue.

Hi, I have meet the same situation, Moreover,this nan. stuff is not only appear during finetuning, but aslo pretraining using Chairs_Things_mix. Did you find the solution?

lelelexxx avatar Apr 04 '20 15:04 lelelexxx

When I trained the model with a RTX 3090 + TF1.15, I got nan at first steps (global step 1, 2, etc). I found TF1.x do not supports RTX3090, TF1.15.x use CUDA 10.0, this configuration reports no errors but results in nan loss(even NaN values in feature maps from feature_estimator layer). I fixed this by reinstalling TF 1.15 with Nvidia-tensorflow. see https://github.com/nvidia/tensorflow.

yaanggny avatar May 15 '22 01:05 yaanggny