Running training on toy dataset fails
Hello, I tried running training on toy dataset using the default hydra script and it fails when loss is set to pca_singleview/pca_multiview with the following stack trace. Kindly help in resolving this.
scripts/train_hydra.py:22: UserWarning: The version_base parameter is not specified. Please specify a compatability version level, or None. Will assume defaults for version 1.1 @hydra.main(config_path="configs", config_name="config") /anaconda/envs/lightning-pose/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default. See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information. ret = run_job( Our Hydra config file:
training parameters
train_batch_size: 16 val_batch_size: 16 test_batch_size: 16 train_prob: 0.8 val_prob: 0.1 train_frames: 1 num_gpus: 0 num_workers: 4 early_stop_patience: 3 unfreezing_epoch: 25 dropout_rate: 0.1 min_epochs: 100 max_epochs: 500 log_every_n_steps: 1 check_val_every_n_epoch: 10 gpu_id: 0 unlabeled_sequence_length: 16 rng_seed_data_pt: 42 rng_seed_data_dali: 43 rng_seed_model_pt: 44 limit_train_batches: 10 multiple_trainloader_mode: max_size_cycle profiler: simple accumulate_grad_batches: 2 lr_scheduler: multisteplr lr_scheduler_params: {'multisteplr': {'milestones': [100, 200, 300], 'gamma': 0.5}}
losses parameters
pca_multiview: {'log_weight': 7.0, 'components_to_keep': 3, 'empirical_epsilon_percentile': 1.0, 'empirical_epsilon_multiplier': 1.0, 'epsilon': None, 'error_metric': 'reprojection_error'} pca_singleview: {'log_weight': 7.25, 'components_to_keep': 0.99, 'empirical_epsilon_percentile': 1.0, 'empirical_epsilon_multiplier': 1.0, 'epsilon': None, 'error_metric': 'reprojection_error'} temporal: {'log_weight': 7.5, 'epsilon': [12.9, 11.3, 10.5, 12.0, 5.0, 7.3, 0.7, 61.8, 11.2, 9.9, 9.7, 10.1, 4.8, 4.9, 1.0, 19.2, 6.8]} unimodal_mse: {'log_weight': 6.5, 'prob_threshold': 0.0} unimodal_kl: {'log_weight': 6.5, 'prob_threshold': 0.0}
data parameters
image_orig_dims: {'width': 396, 'height': 406} image_resize_dims: {'width': 256, 'height': 256} data_dir: toy_datasets/toymouseRunningData video_dir: unlabeled_videos csv_file: CollectedData_.csv header_rows: [1, 2] downsample_factor: 2 num_keypoints: 17 mirrored_column_matches: [[0, 1, 2, 3, 4, 5, 6], [8, 9, 10, 11, 12, 13, 14]] columns_for_singleview_pca: [0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14]
model parameters
losses_to_use: ['pca_singleview'] learn_weights: False resnet_version: 50 model_type: heatmap heatmap_loss_type: mse model_name: my_base_toy_model
callbacks parameters
anneal_weight: {'attr_name': 'total_unsupervised_importance', 'init_val': 0.0, 'increase_factor': 0.01, 'final_val': 1.0, 'freeze_until_epoch': 0}
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2895.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Number of labeled images in the full dataset (train+val+test): 90
Size of -- train set: 72, val set: 9, test set: 9
Warning: the argument {farg[0]} shadows a Pipeline constructor argument of the same name.
[/opt/dali/dali/operators/reader/loader/video_loader.h:178] file_list_include_preceding_frame is set to False (or not set at all). In future releases, the default behavior would be changed to True.
[/opt/dali/dali/operators/reader/nvdecoder/nvdecoder.cc:80] Warning: Decoding on a default stream. Performance may be affected.
Results of running PCA (pca_singleview) on keypoints:
Kept 13/28 components, and found:
Explained variance ratio: [0.315 0.242 0.209 0.073 0.048 0.034 0.021 0.015 0.01 0.007 0.007 0.005
0.004 0.003 0.002 0.001 0.001 0.001 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. ]
Variance explained by 13 components: 0.991
/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/losses/losses.py:326: UserWarning: Using empirical epsilon=0.194 * multiplier=1.000 -> total=0.194 for pca_singleview loss
warnings.warn(
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py:22: LightningDeprecationWarning: pytorch_lightning.core.lightning.LightningModule has been deprecated in v1.7 and will be removed in v1.9. Use the equivalent class from the pytorch_lightning.core.module.LightningModule class instead.
rank_zero_deprecation(
Initializing a SemiSupervisedHeatmapTracker instance.
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torchvision/models/_utils.py:135: UserWarning: Using 'weights' as positional parameter(s) is deprecated since 0.13 and will be removed in 0.15. Please use keyword parameter(s) instead.
warnings.warn(
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing weights=ResNet50_Weights.IMAGENET1K_V1. You can also use weights=ResNet50_Weights.DEFAULT to get the most up-to-date weights.
warnings.warn(msg)
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:446: LightningDeprecationWarning: Setting Trainer(gpus=[0]) is deprecated in v1.7 and will be removed in v2.0. Please use Trainer(accelerator='gpu', devices=[0]) instead.
rank_zero_deprecation(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:285: LightningDeprecationWarning: The Callback.on_epoch_start hook was deprecated in v1.6 and will be removed in v1.8. Please use Callback.on_<train/validation/test>_epoch_start instead.
rank_zero_deprecation(
Missing logger folder: tb_logs/my_base_toy_model
Number of labeled images in the full dataset (train+val+test): 90
Size of -- train set: 72, val set: 9, test set: 9
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
0 | backbone | Sequential | 23.5 M
1 | loss_factory | LossFactory | 0
2 | upsampling_layers | Sequential | 81.0 K
3 | rmse_loss | RegressionRMSELoss | 0
4 | loss_factory_unsup | LossFactory | 0
134 K Trainable params
23.5 M Non-trainable params
23.6 M Total params
94.356 Total estimated model params size (MB)
/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:219: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 6 which is the number of cpus on this machine) in theDataLoader` init to improve performance.
rank_zero_warn(
Epoch 0: 0%| | 0/10 [00:00<?, ?it/s]/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/data/dali.py:103: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
return torch.tensor(
Error executing job with overrides: []
Traceback (most recent call last):
File "scripts/train_hydra.py", line 110, in train
trainer.fit(model=model, datamodule=data_module)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run
results = self._run_stage()
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage
return self._run_train()
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in _run_train
self.fit_loop.run()
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 270, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 203, in advance
batch_output = self.batch_loop.run(kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 87, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 201, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 240, in _run_optimization
closure()
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 146, in call
self._result = self.closure(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 141, in closure
self._backward_fn(step_output.closure_loss)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 304, in backward_fn
self.trainer._call_strategy_hook("backward", loss, optimizer, opt_idx)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1706, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 191, in backward
self.precision_plugin.backward(self.lightning_module, closure_loss, optimizer, optimizer_idx, *args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 80, in backward
model.backward(closure_loss, optimizer, optimizer_idx, *args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 1418, in backward
loss.backward(*args, **kwargs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 14]], which is output 0 of LinalgVectorNormBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. Epoch 0: 0%|
@prateekdhawalia --- it's definitely a new one; i'm on a paternity leave, @themattinthehatt will assist you soon.
please do try to checkout develop branch and let us know if you run into the same error.
@danbider I tried the develop branch and it works fine. Got this issue in the main branch. Thanks for the help.
@themattinthehatt when you have a chance please merge develop --> main?
@danbider @themattinthehatt Will this framework work on images where the object occupies only 20 to 30% of image area? Basically detecting keypoints on small objects present on a bigger image. If yes, kindly suggest the approach.
@danbider I'll run the develop branch through the testing framework then merge into main; will update you all when this is complete.
@prateekdhawalia the framework should work fine if the object is smaller - are you dealing with a freely moving animal in an arena? If not (i.e. the animal is stationary), I'd suggest cropping around the animal first before training the models.
@prateekdhawalia I've now merged develop into main; please raise another issue if you run into more troubles.
@themattinthehatt Thanks for the response. My use case involves freely moving object. Also, I tried both SemiSupervisedHeatMap and SemiSupervisedRegression models. The heatmap model did not give good performance on unlabeled video. But it may be because of very low labeled data(260 images). The Regression model fails during the predict step as there is no implementation of predict_step() for the same. Is this a bug or done intentionally for a reason? Kindly suggest if Heapmap or Regression model should be used.
Hi @prateekdhawalia, sorry to hear you didn't see good performance on your unlabeled video. 260 labeled images should be a reasonable amount - how many labeled keypoints do you have per frame?
Apologies for the lack of predict_step() for the regression model - that was not intentional, we just haven't updated that model yet. I just raised an issue to that effect and will fix it asap. In general though we've found much better performance with the heatmap models.
Hi @themattinthehatt , I have 5 labeled keypoints per frame. Thanks for the info that heatmaps are more accurate. Also, I have noticed that when I use DLC image augmentation and when the image rotation aug is above 10, the code throws an error as below.
Error executing job with overrides: [] Traceback (most recent call last): File "scripts/train_hydra.py", line 175, in train trainer.fit(model=model, datamodule=data_module) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit self._call_and_handle_interrupt( File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1168, in _run results = self._run_stage() File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1254, in _run_stage return self._run_train() File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in _run_train self.fit_loop.run() File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 270, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 203, in advance batch_output = self.batch_loop.run(kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 87, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 201, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 248, in _run_optimization self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 358, in _optimizer_step self.trainer._call_lightning_module_hook( File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1552, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 1673, in optimizer_step optimizer.step(closure=optimizer_closure) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 216, in optimizer_step return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 153, in optimizer_step return optimizer.step(closure=closure, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper return wrapped(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/optim/optimizer.py", line 113, in wrapper return func(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/torch/optim/adam.py", line 118, in step loss = closure() File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 138, in _wrap_closure closure_result = closure() File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 146, in call self._result = self.closure(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 132, in closure step_output = self._step_fn() File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 407, in _training_step training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values()) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1706, in _call_strategy_hook output = fn(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 358, in training_step return self.model.training_step(*args, **kwargs) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/typeguard/init.py", line 1033, in wrapper retval = func(*args, **kwargs) File "/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/models/base.py", line 347, in training_step loss = self.evaluate_labeled(train_batch, "train") File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/typeguard/init.py", line 1033, in wrapper retval = func(*args, **kwargs) File "/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/models/base.py", line 321, in evaluate_labeled data_dict = self.get_loss_inputs_labeled(batch_dict=batch_dict) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/typeguard/init.py", line 1033, in wrapper retval = func(*args, **kwargs) File "/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/models/heatmap_tracker.py", line 233, in get_loss_inputs_labeled predicted_keypoints, confidence = self.run_subpixelmaxima(predicted_heatmaps) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/typeguard/init.py", line 1033, in wrapper retval = func(*args, **kwargs) File "/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/models/heatmap_tracker.py", line 143, in run_subpixelmaxima confidences = evaluate_heatmaps_at_location(heatmaps=softmaxes, locs=preds) File "/anaconda/envs/lightning-pose/lib/python3.8/site-packages/typeguard/init.py", line 1033, in wrapper retval = func(*args, **kwargs) File "/home/walthamadmin/notebooks/projects/lightning-pose/lightning_pose/data/utils.py", line 333, in evaluate_heatmaps_at_location heatmaps_padded[i, j, k_offset, l_offset].squeeze(-1).squeeze(-1) IndexError: index -9223372036854775808 is out of bounds for dimension 2 with size 388
Kindly check is there is a bug and help in correcting this.
@prateekdhawalia I opened a new issue for this, will look into it today https://github.com/danbider/lightning-pose/issues/59#issue-1360165519