LAV icon indicating copy to clipboard operation
LAV copied to clipboard

Evaluation consumes all memory and leads failure

Open woolpeeker opened this issue 2 years ago • 5 comments

I start a carla docker image, and copy the LAV code into the container. The RouteScenorio1 last around 2hours and exit with the message "RuntimeError: Timeout: Agent took too long to setup" I check the program's memory usage, and it consumes all RAM 64G.

The evaluation cmd:

ROUTES=leaderboard/data/routes_testing.xml ./leaderboard/scripts/run_evaluation.sh

The carla run in a headless mode,

SDL_VIDEODRIVER=offscreen SDL_HINT_CUDA_DEVICE=0 ./CarlaUE4.sh -ResX=800 -ResY=600 -nosound -windowed -opengl

-vulkan flag causes immediate existence. LAV readme says the carla should run with -vulkan flag Could that be the problem or are there any possible clues in your mind?

woolpeeker avatar Apr 07 '22 15:04 woolpeeker

Have you pinned down the issue is actually OOM? This feels like it has something to do with opengl but I am not sure. To help more I need more info: what is your max RAM and does CARLA server throw any errors?

dotchen avatar Apr 07 '22 20:04 dotchen

There is no OOM error reported. The system is Ubuntu18.04 and the evaluation is running in a docker container. The max RAM of my computer is 64G, GPU is 2080Ti-12G

when the evaluation for the route0 lasts around 2hours (I don't remember the precise time) The sim_time last reported is around 500 seconds. The evaluation program report is as follows:

Carla Could not set up the required agent: 
> Timeout: Agent took too long to setup Watchdog
exception - Timeout of 59.0 seconds occurred

Then the evaluation program tries to evaluate the rest route and directly report the same error for each route.

When it is reported, I found the computer is very slow and the memory usage is up to 64G. The Carla process consumes around 20G, and the rest is used by the evaluation process. The system began to use the SWAP for memory, so I stopped the program.

The Carla server does not throw any errors.

The evaluation time for Route0 exceeds 90 mins. Do you know that is normal?

woolpeeker avatar Apr 08 '22 02:04 woolpeeker

I change the eval cmd to ROUTES=assets/routes_lav_valid.xml ./leaderboard/scripts/run_evaluation.sh it success to finish the route0 test, which cost about 1 hour

The memory usage is also huge, reaching max 37G. If count the carla's memory usage, it's 57G. It's close to the limit of my computer.

woolpeeker avatar Apr 08 '22 05:04 woolpeeker

Hmm I don't have really have any anything on my mind that might help you. Since you mentioned you use docker maybe try our docker image recipe and see if it makes a difference. I have uploaded it in the repo: https://github.com/dotchen/LAV/blob/main/Dockerfile

dotchen avatar Apr 09 '22 03:04 dotchen

when I run the train_full.py ,there is no OOM error reported. lat_features = features.expand(N,*features.size()).permute(1,0,2,3,4).contiguous()[typs] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.78 GiB (GPU 0; 7.79 GiB total capacity; 1.48 GiB already allocated; 2.78 GiB free; 1.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.

my computer GPU is 8G,how could I fix it?Actually ,I have reduce batch size,workers ,and try to clear gpu cache,but it also throw this error.

viola521 avatar Jul 07 '23 17:07 viola521