Vector Model training crashes
Describe the bug
Training of the vector model crashes. I think this is the same error as in #82. However, caching the features prior to training does not solve the issue anymore. Additionally, this issue in torch 1.9.0 causes the error to not be reported as a shape mismatch for the linear layer, but as a shape mismatch for the gradient. The root cause of this (as stated in #82) is probably the validity condition of the agents feature resulting in an invalid feature if there are no agents in the scene.
Caching results in: Completed dataset caching! Failed features and targets: 41 out of 1533645.
Setup
- devkit-0.6 (env was recreated and features re-cached)
- nvidia a100 gpu
Steps To Reproduce
Steps to reproduce the behavior:
- Features are cached for the entire mini dataset
+training=training_vector_model \
py_func=cache \
cache.cache_path=/path/to/cache \
data_augmentation="[]" \```
2. Training on the vector model on the cached data results in the error.
```python ~/nuplan/nuplan-devkit/nuplan/planning/script/run_training.py \
+training=training_vector_model \
py_func=train \
cache.cache_path=/path/to/cache \
scenario_filter.limit_total_scenarios=100000 \
data_loader.params.batch_size=16 \
data_loader.params.num_workers=8 \
scenario_builder=nuplan_mini \
scenario_builder.data_root=/path/to/nuplan-v1.0/mini \
experiment_name=baseline_lgcn \
Stack Trace
Traceback (most recent call last):
File "/home/aah1si/nuplan/nuplan-devkit/nuplan/planning/script/run_training.py", line 61, in main
engine.trainer.fit(model=engine.model, datamodule=engine.datamodule)
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
self._run(model)
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in _run
self.dispatch()
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 799, in dispatch
self.accelerator.start_training(self)
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in run_stage
return self.run_train()
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 871, in run_train
self.train_loop.run_training_epoch()
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 499, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 738, in run_training_batch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 434, in optimizer_step
model_ref.optimizer_step(
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1403, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 325, in optimizer_step
make_optimizer_step = self.precision_plugin.pre_optimizer_step(
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 93, in pre_optimizer_step
result = lambda_closure()
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 732, in train_step_and_backward_closure
result = self.training_step_and_backward(
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 836, in training_step_and_backward
self.backward(result, optimizer, opt_idx)
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 869, in backward
result.closure_loss = self.trainer.accelerator.backward(
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 308, in backward
output = self.precision_plugin.backward(
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 62, in backward
closure_loss = super().backward(model, closure_loss, optimizer, opt_idx, should_accumulate, *args, **kwargs)
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 79, in backward
model.backward(closure_loss, optimizer, opt_idx)
File "/home/aah1si/.conda/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1275, in backward
loss.backward(*args, **kwargs)
File "/home/aah1si/.local/lib/python3.9/site-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/aah1si/.local/lib/python3.9/site-packages/torch/autograd/__init__.py", line 147, in backward
Variable._execution_engine.run_backward(
RuntimeError: Function MmBackward returned an invalid gradient at index 0 - got [0, 134] but expected shape compatible with [0, 128]```
Hi @mh0797, apologies for the delayed response, and thank you for the investigation. We are looking into this and making sure empty agent features are being handled correctly.
Hi @mh0797, sorry for late. Could you use latest version to try again? Could you give us some environment information? Thanks
The error occured using the current devkit-0.6 version. I recreated the environment after updating to this version.
I met the same error.
Hi @mh0797,
Are you still facing the same issue since the v1.0 release?
Hi @patk-motional, sorry for taking so long - I had to set up a fresh environment, cache a new dataset and run a training which is very time-consuming. I was able to train the model for an entire epoch without error on the nuplan-mini dataset. So, I guess we can close this issue. For anybody interested, here is the training command:
python ~/nuplan/nuplan-devkit/nuplan/planning/script/run_training.py \
+training=training_vector_model \
py_func=train \
cache.cache_path=/path/to/cache \
data_loader.params.batch_size=2 \
lightning.trainer.params.max_epochs=10 \
optimizer.lr=5e-5 \
experiment_name=vector_model \
lightning.trainer.params.max_time=null \
scenario_filter.remove_invalid_goals=true \