lightning-pose icon indicating copy to clipboard operation
lightning-pose copied to clipboard

Running training on toy dataset fails

Open prateekdhawalia opened this issue 3 years ago • 3 comments

Hello, I tried running training on toy dataset using the default hydra script and it fails when loss is set to pca_singleview/pca_multiview with the following stack trace. Kindly help in resolving this.

scripts/train_hydra.py:22: UserWarning: The version_base parameter is not specified. Please specify a compatability version level, or None. Will assume defaults for version 1.1 @hydra.main(config_path="configs", config_name="config") /anaconda/envs/lightning-pose/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default. See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information. ret = run_job( Our Hydra config file:

training parameters

train_batch_size: 16 val_batch_size: 16 test_batch_size: 16 train_prob: 0.8 val_prob: 0.1 train_frames: 1 num_gpus: 0 num_workers: 4 early_stop_patience: 3 unfreezing_epoch: 25 dropout_rate: 0.1 min_epochs: 100 max_epochs: 500 log_every_n_steps: 1 check_val_every_n_epoch: 10 gpu_id: 0 unlabeled_sequence_length: 16 rng_seed_data_pt: 42 rng_seed_data_dali: 43 rng_seed_model_pt: 44 limit_train_batches: 10 multiple_trainloader_mode: max_size_cycle profiler: simple accumulate_grad_batches: 2 lr_scheduler: multisteplr lr_scheduler_params: {'multisteplr': {'milestones': [100, 200, 300], 'gamma': 0.5}}


losses parameters

pca_multiview: {'log_weight': 7.0, 'components_to_keep': 3, 'empirical_epsilon_percentile': 1.0, 'empirical_epsilon_multiplier': 1.0, 'epsilon': None, 'error_metric': 'reprojection_error'} pca_singleview: {'log_weight': 7.25, 'components_to_keep': 0.99, 'empirical_epsilon_percentile': 1.0, 'empirical_epsilon_multiplier': 1.0, 'epsilon': None, 'error_metric': 'reprojection_error'} temporal: {'log_weight': 7.5, 'epsilon': [12.9, 11.3, 10.5, 12.0, 5.0, 7.3, 0.7, 61.8, 11.2, 9.9, 9.7, 10.1, 4.8, 4.9, 1.0, 19.2, 6.8]} unimodal_mse: {'log_weight': 6.5, 'prob_threshold': 0.0} unimodal_kl: {'log_weight': 6.5, 'prob_threshold': 0.0}


data parameters

image_orig_dims: {'width': 396, 'height': 406} image_resize_dims: {'width': 256, 'height': 256} data_dir: toy_datasets/toymouseRunningData video_dir: unlabeled_videos csv_file: CollectedData_.csv header_rows: [1, 2] downsample_factor: 2 num_keypoints: 17 mirrored_column_matches: [[0, 1, 2, 3, 4, 5, 6], [8, 9, 10, 11, 12, 13, 14]] columns_for_singleview_pca: [0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14]


model parameters

losses_to_use: ['pca_singleview'] learn_weights: False resnet_version: 50 model_type: heatmap heatmap_loss_type: mse model_name: my_base_toy_model


callbacks parameters

anneal_weight: {'attr_name': 'total_unsupervised_importance', 'init_val': 0.0, 'increase_factor': 0.01, 'final_val': 1.0, 'freeze_until_epoch': 0}

/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2895.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] Number of labeled images in the full dataset (train+val+test): 90 Size of -- train set: 72, val set: 9, test set: 9 Warning: the argument {farg[0]} shadows a Pipeline constructor argument of the same name. [/opt/dali/dali/operators/reader/loader/video_loader.h:178] file_list_include_preceding_frame is set to False (or not set at all). In future releases, the default behavior would be changed to True. [/opt/dali/dali/operators/reader/nvdecoder/nvdecoder.cc:80] Warning: Decoding on a default stream. Performance may be affected. Results of running PCA (pca_singleview) on keypoints: Kept 13/28 components, and found: Explained variance ratio: [0.315 0.242 0.209 0.073 0.048 0.034 0.021 0.015 0.01 0.007 0.007 0.005 0.004 0.003 0.002 0.001 0.001 0.001 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ] Variance explained by 13 components: 0.991 /home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/losses/losses.py:326: UserWarning: Using empirical epsilon=0.194 * multiplier=1.000 -> total=0.194 for pca_singleview loss warnings.warn( /anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py:22: LightningDeprecationWarning: pytorch_lightning.core.lightning.LightningModule has been deprecated in v1.7 and will be removed in v1.9. Use the equivalent class from the pytorch_lightning.core.module.LightningModule class instead. rank_zero_deprecation(

Initializing a SemiSupervisedHeatmapTracker instance. /anaconda/envs/lightning-pose/lib/python3.8/site-packages/torchvision/models/_utils.py:135: UserWarning: Using 'weights' as positional parameter(s) is deprecated since 0.13 and will be removed in 0.15. Please use keyword parameter(s) instead. warnings.warn( /anaconda/envs/lightning-pose/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing weights=ResNet50_Weights.IMAGENET1K_V1. You can also use weights=ResNet50_Weights.DEFAULT to get the most up-to-date weights. warnings.warn(msg) /anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:446: LightningDeprecationWarning: Setting Trainer(gpus=[0]) is deprecated in v1.7 and will be removed in v2.0. Please use Trainer(accelerator='gpu', devices=[0]) instead. rank_zero_deprecation( GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs /anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:285: LightningDeprecationWarning: The Callback.on_epoch_start hook was deprecated in v1.6 and will be removed in v1.8. Please use Callback.on_<train/validation/test>_epoch_start instead. rank_zero_deprecation( Missing logger folder: tb_logs/my_base_toy_model Number of labeled images in the full dataset (train+val+test): 90 Size of -- train set: 72, val set: 9, test set: 9 LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

| Name | Type | Params

0 | backbone | Sequential | 23.5 M 1 | loss_factory | LossFactory | 0
2 | upsampling_layers | Sequential | 81.0 K 3 | rmse_loss | RegressionRMSELoss | 0
4 | loss_factory_unsup | LossFactory | 0

134 K Trainable params 23.5 M Non-trainable params 23.6 M Total params 94.356 Total estimated model params size (MB) /anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:219: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 6 which is the number of cpus on this machine) in theDataLoader` init to improve performance. rank_zero_warn( Epoch 0: 0%| | 0/10 [00:00<?, ?it/s]/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/data/dali.py:103: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). return torch.tensor( Error executing job with overrides: [] Traceback (most recent call last): File "scripts/train_hydra.py", line 110, in train trainer.fit(model=model, datamodule=data_module) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit self._call_and_handle_interrupt( File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run results = self._run_stage() File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage return self._run_train() File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in _run_train self.fit_loop.run() File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 270, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 203, in advance batch_output = self.batch_loop.run(kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 87, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 201, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 240, in _run_optimization closure() File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 146, in call self._result = self.closure(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 141, in closure self._backward_fn(step_output.closure_loss) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 304, in backward_fn self.trainer._call_strategy_hook("backward", loss, optimizer, opt_idx) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1706, in _call_strategy_hook output = fn(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 191, in backward self.precision_plugin.backward(self.lightning_module, closure_loss, optimizer, optimizer_idx, *args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 80, in backward model.backward(closure_loss, optimizer, optimizer_idx, *args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 1418, in backward loss.backward(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 14]], which is output 0 of LinalgVectorNormBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. Epoch 0: 0%|

prateekdhawalia avatar Aug 18 '22 19:08 prateekdhawalia

@prateekdhawalia --- it's definitely a new one; i'm on a paternity leave, @themattinthehatt will assist you soon. please do try to checkout develop branch and let us know if you run into the same error.

danbider avatar Aug 18 '22 19:08 danbider

@danbider I tried the develop branch and it works fine. Got this issue in the main branch. Thanks for the help.

prateekdhawalia avatar Aug 19 '22 07:08 prateekdhawalia

@themattinthehatt when you have a chance please merge develop --> main?

danbider avatar Aug 19 '22 14:08 danbider

@danbider @themattinthehatt Will this framework work on images where the object occupies only 20 to 30% of image area? Basically detecting keypoints on small objects present on a bigger image. If yes, kindly suggest the approach.

prateekdhawalia avatar Aug 25 '22 13:08 prateekdhawalia

@danbider I'll run the develop branch through the testing framework then merge into main; will update you all when this is complete.

@prateekdhawalia the framework should work fine if the object is smaller - are you dealing with a freely moving animal in an arena? If not (i.e. the animal is stationary), I'd suggest cropping around the animal first before training the models.

themattinthehatt avatar Aug 25 '22 17:08 themattinthehatt

@prateekdhawalia I've now merged develop into main; please raise another issue if you run into more troubles.

themattinthehatt avatar Aug 25 '22 20:08 themattinthehatt

@themattinthehatt Thanks for the response. My use case involves freely moving object. Also, I tried both SemiSupervisedHeatMap and SemiSupervisedRegression models. The heatmap model did not give good performance on unlabeled video. But it may be because of very low labeled data(260 images). The Regression model fails during the predict step as there is no implementation of predict_step() for the same. Is this a bug or done intentionally for a reason? Kindly suggest if Heapmap or Regression model should be used.

prateekdhawalia avatar Aug 26 '22 10:08 prateekdhawalia

Hi @prateekdhawalia, sorry to hear you didn't see good performance on your unlabeled video. 260 labeled images should be a reasonable amount - how many labeled keypoints do you have per frame?

Apologies for the lack of predict_step() for the regression model - that was not intentional, we just haven't updated that model yet. I just raised an issue to that effect and will fix it asap. In general though we've found much better performance with the heatmap models.

themattinthehatt avatar Aug 29 '22 23:08 themattinthehatt

Hi @themattinthehatt , I have 5 labeled keypoints per frame. Thanks for the info that heatmaps are more accurate. Also, I have noticed that when I use DLC image augmentation and when the image rotation aug is above 10, the code throws an error as below.

Error executing job with overrides: [] Traceback (most recent call last): File "scripts/train_hydra.py", line 175, in train trainer.fit(model=model, datamodule=data_module) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit self._call_and_handle_interrupt( File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run results = self._run_stage() File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage return self._run_train() File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in _run_train self.fit_loop.run() File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 270, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 203, in advance batch_output = self.batch_loop.run(kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 87, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 201, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 248, in _run_optimization self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 358, in _optimizer_step self.trainer._call_lightning_module_hook( File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1552, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 1673, in optimizer_step optimizer.step(closure=optimizer_closure) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 216, in optimizer_step return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 153, in optimizer_step return optimizer.step(closure=closure, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper return wrapped(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/optim/optimizer.py", line 113, in wrapper return func(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/optim/adam.py", line 118, in step loss = closure() File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 138, in _wrap_closure closure_result = closure() File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 146, in call self._result = self.closure(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 132, in closure step_output = self._step_fn() File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 407, in _training_step training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values()) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1706, in _call_strategy_hook output = fn(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 358, in training_step return self.model.training_step(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/typeguard/init.py", line 1033, in wrapper retval = func(*args, **kwargs) File "/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/models/base.py", line 347, in training_step loss = self.evaluate_labeled(train_batch, "train") File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/typeguard/init.py", line 1033, in wrapper retval = func(*args, **kwargs) File "/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/models/base.py", line 321, in evaluate_labeled data_dict = self.get_loss_inputs_labeled(batch_dict=batch_dict) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/typeguard/init.py", line 1033, in wrapper retval = func(*args, **kwargs) File "/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/models/heatmap_tracker.py", line 233, in get_loss_inputs_labeled predicted_keypoints, confidence = self.run_subpixelmaxima(predicted_heatmaps) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/typeguard/init.py", line 1033, in wrapper retval = func(*args, **kwargs) File "/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/models/heatmap_tracker.py", line 143, in run_subpixelmaxima confidences = evaluate_heatmaps_at_location(heatmaps=softmaxes, locs=preds) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/typeguard/init.py", line 1033, in wrapper retval = func(*args, **kwargs) File "/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/data/utils.py", line 333, in evaluate_heatmaps_at_location heatmaps_padded[i, j, k_offset, l_offset].squeeze(-1).squeeze(-1) IndexError: index -9223372036854775808 is out of bounds for dimension 2 with size 388

Kindly check is there is a bug and help in correcting this.

prateekdhawalia avatar Sep 01 '22 18:09 prateekdhawalia

@prateekdhawalia I opened a new issue for this, will look into it today https://github.com/danbider/lightning-pose/issues/59#issue-1360165519

themattinthehatt avatar Sep 02 '22 13:09 themattinthehatt