Training Fails on Validation Sanity Check

Open madhawav opened this issue 5 years ago • 1 comments

Hi, When I try to train on a new dataset, it fails with the following error.

[PYTHON_ENV_PATH]/neuraltexture/bin/python -u [PROJECT_ROOT]/code/train_neural_texture.py
[PYTHON_ENV_PATH]/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
[PYTHON_ENV_PATH]/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
[PYTHON_ENV_PATH]/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
[PYTHON_ENV_PATH]/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
[PYTHON_ENV_PATH]/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
[PYTHON_ENV_PATH]/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Use pytorch 1.4.0
Load config: configs/neural_texture/config_default.yaml
INFO:lightning:GPU available: True, used: True
INFO:lightning:CUDA_VISIBLE_DEVICES: [0]
[PYTHON_ENV_PATH]/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:23: RuntimeWarning: You have defined a `val_dataloader()` and have defined a `validation_step()`, you may also want to define `validation_epoch_end()` for accumulating stats.
  warnings.warn(*args, **kwargs)
[PYTHON_ENV_PATH]/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:23: RuntimeWarning: You have defined a `test_dataloader()` and have defined a `test_step()`, you may also want to define `test_epoch_end()` for accumulating stats.
  warnings.warn(*args, **kwargs)
Validation sanity check: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "[PROJECT_ROOT]/neuraltexture/code/train_neural_texture.py", line 47, in <module>
    trainer.fit(system)
  File "[PYTHON_ENV_PATH]/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 765, in fit
    self.single_gpu_train(model)
  File "[PYTHON_ENV_PATH]/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 492, in single_gpu_train
    self.run_pretrain_routine(model)
  File "[PYTHON_ENV_PATH]/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 896, in run_pretrain_routine
    eval_results = self._evaluate(model,
  File "[PYTHON_ENV_PATH]/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 322, in _evaluate
    eval_results = model.validation_end(outputs)
  File "[PROJECT_ROOT]/neuraltexture/code/systems/s_core.py", line 33, in validation_end
    for key in outputs[0].keys():
IndexError: list index out of range

Process finished with exit code 1

Additional Information

My dataset: As a sanity check, I use all the test images provided by you as my dataset. Thus, I have a folder called "all" in the "datasets" directory which has two sub-directories "train" and "test". I have copied all the test images provided by you into both of these directories.
The working directory is "[PROJECT_ROOT]/code".
My Operating System is Ubuntu 16.04.
PyTorch Lightning 0.7.5 is installed.

My "config_default.yml" Is shown below:

version_name: neuraltexture_all_2d_single
device: cuda
n_workers: 8
n_gpus: 1
dim: 2
noise:
  octaves: 8
logger:
  log_files_every_n_iter: 1000
  log_scalars_every_n_iter: 100
  log_validation_every_n_epochs: 1
image:
  image_res: &image_res 128 # (height, width)
texture:
  e: &texture_e 64 # encoding size
dataset:
  name: datasets.images
  path: '../datasets/all'
  use_single: -1 # -1 = all, 0,1,2 for single
system:
  block_main:
    model_texture_encoder:
      model_params:
        name: models.neural_texture.encoder
        type: 'ResNet'
        shape_in:  [[3, *image_res, *image_res]]
        bottleneck_size: 8
    model_texture_mlp:
      model_params:
        name: models.neural_texture.mlp
        type: 'MLP'
        n_max_features: 128
        n_blocks: 4
        dropout_ratio: 0.0
        non_linearity: 'relu'
        bias: True
        encoding: *texture_e
    optimizer_params:
      name: 'adam'
      lr: 0.0001
      weight_decay: 0.0001
    scheduler_params:
      name: 'none'
    loss_params:
      style_weight: 1.0
      style_type: 'mse'
train:
  epochs: 3
  bs: 16
  accumulate_grad_batches: 1
  seed: 41127

Your help is much appreciated.

Jul 10 '20 19:07 madhawav

Had the same issue, tweaked the code a bit to:

if len(outputs)>0:
            for key in outputs[0].keys():
                logs[key] = torch.stack([x[key] for x in outputs]).mean()
        else: 
            logs['val_loss']=torch.tensor(0.)

This is very ad hoc, i think the code needs a 'val' folder as well as a train and test

Dec 12 '22 19:12 PierrickCh

neuraltexture neuraltexture copied to clipboard

Training Fails on Validation Sanity Check

Additional Information

neuraltexture
neuraltexture copied to clipboard