navsim icon indicating copy to clipboard operation
navsim copied to clipboard

Error while loading data for transfuser

Open bazyagami opened this issue 9 months ago • 8 comments

Hello, while attempting to run transfuser (using ./run_transfuser_training.sh), I am getting the following error :

Traceback (most recent call last):
  File "/mnt/disks/data/sim2real/CVPR-challenge/navsim/navsim/navsim/planning/script/run_training.py", line 119, in main
    trainer.fit(
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run
    results = self._run_stage()
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1033, in _run_stage
    self.fit_loop.run()
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 140, in run
    self.advance(data_fetcher)
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 212, in advance
    batch, _, __ = next(data_fetcher)
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/loops/fetchers.py", line 133, in __next__
    batch = super().__next__()
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/loops/fetchers.py", line 60, in __next__
    batch = next(self.iterator)
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
    out = next(self._iterator)
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/pytorch_lightning/utilities/combined_loader.py", line 78, in __next__
    out[i] = next(self.iterators[i])
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 635, in __next__
    data = self._next_data()
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 679, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 56, in fetch
    return self.collate_fn(data)
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 267, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 142, in collate
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 142, in <listcomp>
    return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed]  # Backwards compatibility.
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 127, in collate
    return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 127, in <dictcomp>
    return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 119, in collate
    return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
  File "/opt/conda/envs/navsim/lib/python3.9/site-packages/torch/utils/data/_utils/collate.py", line 164, in collate_tensor_fn
    return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [3, 256, 1024] at entry 0 and [1, 256, 1024] at entry 2

I was getting an error stating : RuntimeError: Trying to resize storage that is not resizable, when i had number of workers as 4 with a prefetch factor 2, so i modified it to 0 and none. Is there a fix you could suggest for this?

bazyagami avatar May 08 '24 07:05 bazyagami

Hi @bazyagami,

Did you make any modifications to the transfuser code, the training script or any other part of navsim before starting the training? Can you please share which versions of pytorch and pytorch-lightning you are using? Did you generate a training cache before starting the training?

mh0797 avatar May 08 '24 07:05 mh0797

No modifications were made except for the change to the number of workers and prefetch factor as mentioned above. torch - 2.0.1 pytorch-lightning - 2.2.1 yes, a training cache has been generated and it is saved inside the directory "training_cache", only thing to note is some sub-directories have "ego_status_features" and some how "transfuser_features" like this : github-issue-navsim

bazyagami avatar May 08 '24 13:05 bazyagami

It seems like you used the same cache directory to also train the ego_status_mlp_agent. Could you try to generate a new training cache into a separate directory only for the transfuser model before running the training?

mh0797 avatar May 09 '24 08:05 mh0797

Hey, i have done this and the error still persists. Is it because of the way the features are being handled in the transfuser_features.py file by any chance?

bazyagami avatar May 10 '24 00:05 bazyagami

Hi @bazyagami, I am unfortunately still unable to reproduce the issue. Can you please clarify the following questions:

  • Which version of navsim are you using. Please provide the commit hash of the version used for training (git rev-parse HEAD)
  • Please provide some details on the system you are using (OS, number of GPUs, etc.). Are you running the training on gpu?
  • Which split are you using for training? Could you try to reproduce the error on the mini split?
  • Does the error occur right at the beginning of the training or at some intermediate step?
  • Does the error only occur for the transfuser model or also for other agents (e.g, ego-mlp-agent)?

mh0797 avatar May 10 '24 07:05 mh0797

8af06bd77c58396c675cc5f7e3255efa6f3ac7cf - here's the commit hash. i am using the trainval split with navtrain scenefilter. i will get back after running with mini! the error occurs right away when the first epoch (epoch 0) starts. I am training on an instance with 4xT4s. I previously ran the ego-mlp-agent model and it works all fine, it persists only with the transfuser model.

Edit 1 : works with mini split

bazyagami avatar May 10 '24 14:05 bazyagami

Hey, just following up on this, any updates?

bazyagami avatar May 14 '24 12:05 bazyagami

Hi @bazyagami, Unfortunately, I could still not reproduce the issue despite trying several setups. Have you tried setting the batch-size to 1 for debugging? Besides, you could test to manually load the .gz files and inspect them for anomalies. The error trace suggests that this could also be an issue with torch. You might want to consider raising an issue there.

mh0797 avatar May 14 '24 19:05 mh0797