LAV icon indicating copy to clipboard operation
LAV copied to clipboard

Minimum requirements of VRAM

Open keishihara opened this issue 2 years ago • 4 comments

Hi Chen, thank you so much for sharing your amazing work!

I tried and was able to run your pretrained agent with weight checkpoints in weights folder. Currently, I decided to reproduce your work from training steps described in TRAINING.md. At this moment, I am at the step of training privileged motion planner and getting CUDA out of memory error with your dev_test version of dataset which I think is small enough to start with. Apparently, when I ran python -m lav.train_bev, this line of code is consuming a lot of gpu memory and causing the above error. btw, my ubuntu machine has two little older Titan X with 12GB VRAM each.

I am wondering what the requirements of graphic cards' spec is to reproduce this work from scratch. Is my pc not enough to do this or can you tell me about your machine's specification?

keishihara avatar May 25 '22 04:05 keishihara

This might depend on your PyTorch/CUDA/cuDNN version. With the one I use, this line is necessary to prevent a CUDA error. If it works fine without it in your setup you could comment that line out, I don't think it will affect the predictions. See: https://pytorch.org/docs/1.7.1/_modules/torch/nn/modules/rnn.html#RNNBase.flatten_parameters

I use 4 Titan Pascal.

dotchen avatar May 25 '22 06:05 dotchen

Thank you for quick reply! Ok, I was able to run that script just now by setting the batch size to 128 instead of the default 512. So it appears the problem was actually a hardware limitation, since the 4 Titan Pascal have over 40GB of vram in total.

I have another question regarding your Dockerfile. As I understand it, this Docker image is only for running evaluations of trained agents, not for training them, right? If you have the one for training, could you share that too? Or did you do it on localhost rather than in a container?

keishihara avatar May 25 '22 09:05 keishihara

Hello,

We did not use a docker for training but uses a conda env to manage the dependencies. Let me know if you have any issues with the dependencies and I am happy to take a look.

dotchen avatar Jun 16 '22 08:06 dotchen

Hi, thank you for your message.

I managed to create a docker image for training and maintain the dependencies, and apparently all of the modules provided in lav folder are working ok. However, while individual training logs in wandb seemed fine, when I use segmentation model that I trained myself to perform point painting or train full model, the segmentation performance decreases quite a lot probably due to seg_model.eval() call, like here

https://github.com/dotchen/LAV/blob/dc9b4cfca39abd50c7438e8749d49f6ac0fe5e4e/lav/data_paint.py#L57

I wonder if you have experienced the same issue when you trained your models provided in weights folder.

This seems to be somewhat related to the switching behavior stuff of BatchNorm / Dropout between training and testing, but I couldn't figure it out yet. Do you have any idea about this?

keishihara avatar Jun 23 '22 02:06 keishihara