navsim
navsim copied to clipboard
Error while loading data for transfuser
Hello, while attempting to run transfuser (using ./run_transfuser_training.sh), I am getting the following error :
Traceback (most recent call last):
File "/mnt/disks/data/sim2real/CVPR-challenge/navsim/navsim/navsim/planning/script/run_training.py", line 119, in main
trainer.fit(
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run
results = self._run_stage()
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1033, in _run_stage
self.fit_loop.run()
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
self.advance()
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
self.epoch_loop.run(self._data_fetcher)
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 140, in run
self.advance(data_fetcher)
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 212, in advance
batch, _, __ = next(data_fetcher)
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/loops/fetchers.py", line 133, in __next__
batch = super().__next__()
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/loops/fetchers.py", line 60, in __next__
batch = next(self.iterator)
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
out = next(self._iterator)
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/utilities/combined_loader.py", line 78, in __next__
out[i] = next(self.iterators[i])
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 635, in __next__
data = self._next_data()
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 679, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 56, in fetch
return self.collate_fn(data)
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 267, in default_collate
return collate(batch, collate_fn_map=default_collate_fn_map)
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 142, in collate
return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility.
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 142, in <listcomp>
return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility.
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 127, in collate
return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 127, in <dictcomp>
return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 119, in collate
return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
File "/opt/conda/envs/navsim/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 164, in collate_tensor_fn
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [3, 256, 1024] at entry 0 and [1, 256, 1024] at entry 2
I was getting an error stating : RuntimeError: Trying to resize storage that is not resizable, when i had number of workers as 4 with a prefetch factor 2, so i modified it to 0 and none. Is there a fix you could suggest for this?
Hi @bazyagami,
Did you make any modifications to the transfuser code, the training script or any other part of navsim before starting the training? Can you please share which versions of pytorch and pytorch-lightning you are using? Did you generate a training cache before starting the training?
No modifications were made except for the change to the number of workers and prefetch factor as mentioned above.
torch - 2.0.1
pytorch-lightning - 2.2.1
yes, a training cache has been generated and it is saved inside the directory "training_cache", only thing to note is some sub-directories have "ego_status_features" and some how "transfuser_features" like this :
It seems like you used the same cache directory to also train the ego_status_mlp_agent. Could you try to generate a new training cache into a separate directory only for the transfuser model before running the training?
Hey, i have done this and the error still persists. Is it because of the way the features are being handled in the transfuser_features.py file by any chance?
Hi @bazyagami, I am unfortunately still unable to reproduce the issue. Can you please clarify the following questions:
- Which version of navsim are you using. Please provide the commit hash of the version used for training (
git rev-parse HEAD
) - Please provide some details on the system you are using (OS, number of GPUs, etc.). Are you running the training on gpu?
- Which split are you using for training? Could you try to reproduce the error on the mini split?
- Does the error occur right at the beginning of the training or at some intermediate step?
- Does the error only occur for the transfuser model or also for other agents (e.g, ego-mlp-agent)?
8af06bd77c58396c675cc5f7e3255efa6f3ac7cf - here's the commit hash. i am using the trainval split with navtrain scenefilter. i will get back after running with mini! the error occurs right away when the first epoch (epoch 0) starts. I am training on an instance with 4xT4s. I previously ran the ego-mlp-agent model and it works all fine, it persists only with the transfuser model.
Edit 1 : works with mini split
Hey, just following up on this, any updates?
Hi @bazyagami,
Unfortunately, I could still not reproduce the issue despite trying several setups.
Have you tried setting the batch-size to 1 for debugging?
Besides, you could test to manually load the .gz
files and inspect them for anomalies.
The error trace suggests that this could also be an issue with torch. You might want to consider raising an issue there.