fasterrcnn-pytorch-training-pipeline icon indicating copy to clipboard operation
fasterrcnn-pytorch-training-pipeline copied to clipboard

RuntimeError: CUDA out of memory.

Open uysalfurkan opened this issue 2 years ago • 28 comments

Hi, I am working in Kaggle with custom dataset. I got RuntimeError and I have no idea how to solve it. Can you help me?

RuntimeError: CUDA out of memory. Tried to allocate 2.44 GiB (GPU 0; 15.90 GiB total capacity; 9.95 GiB already allocated; 1.70 GiB free; 13.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

uysalfurkan avatar Nov 17 '22 13:11 uysalfurkan

@uysalfurkan Can you try reducing the batch size? I think that should solve the issue.

sovit-123 avatar Nov 17 '22 14:11 sovit-123

@sovit-123 firstly thank you for your answer, yes it helps me but I need to compare yolov5 and faster r can in my task.

In yolov5, training batch_size=64 gave me the best results. So I need to complete this process with batch_size=64, is it possible ?

uysalfurkan avatar Nov 17 '22 14:11 uysalfurkan

@uysalfurkan Try with smaller batch size, in my opinion. Faster RCNN models are generally larger YOLOv5 models. From my experience, Faster RCNN models give good results with batch size 4 as well. Please try that. And if you like the library, I will surely love to get some feedback from you.

sovit-123 avatar Nov 17 '22 14:11 sovit-123

Thanks four your advice, yes I like it, you have prepared a very nice and clear library. I clearly understand the codes even though I am new in object detection.

I need to add some new lines to your source code to obtain more detailed result (csv) file that contain performance metrics for both val and train. I will connect you if I face a problem.

Thank you!

uysalfurkan avatar Nov 17 '22 14:11 uysalfurkan

Thank you for the feedback. @uysalfurkan I CSV file with mAP metric is already saved. Let me know what other things you want to track in the CSV file. I will add them as well as part of the library.

sovit-123 avatar Nov 17 '22 15:11 sovit-123

My purpose is to visualize the metrics (mAP [0.5 and 0.5_0.9], recall, precision, loss [object and box]) of both validation and training per epoch over the same plot. In order to do that I need a csv file which contains these metrics and epoch number.

Additionally, it would be perfect if we have a result_info.txt file that generated at the end of the training process and contains hyper parameters such as learning rate, optimizer name, batch size and model informations such as pre-trained model and backbone version.

I will be appropriate If you can add all this things as a part of the library. I am trying to add but I'm having trouble figuring out in which file the performance results are generated and what the variable names are.

Thank you!

uysalfurkan avatar Nov 17 '22 16:11 uysalfurkan

@uysalfurkan Hi, Some of the things like an opt.yaml file containing all the hyperparameters and model names are already saved to the resulting directory. I will try to add the other things. Some of the things like the validation losses are a bit difficult to add as the PyTorch Faster RCNN models don't output any loss values in eval() mode. A lot of things apart from the validation loss are already saved. I will try to add the other things. But may take some time as I am the only person working on this project.

In the meantime I can add the mAP values, and all the training losses to the CSV file.

sovit-123 avatar Nov 18 '22 00:11 sovit-123

@uysalfurkan I have also updated WandB to plot everything according to epoch instead of iterations which is a bit easier to interpret.

sovit-123 avatar Nov 18 '22 02:11 sovit-123

@sovit-123 Hi, thank you for your interest. I am waiting for updates. I have added a screenshot of csv example which I need to get. results_csv_example.csv

By the way, can I modify the train.py file from the notebook? For example, I will change the learning rate value or add a new plot function with confidence score to the annotation.py file.

This is easy when i work in local but how can i do it in kaggle environment. (For example, we replace the yaml file with %%writefile)

uysalfurkan avatar Nov 18 '22 09:11 uysalfurkan

@uysalfurkan The CSV file update has been made. It's slightly different as of now compared to what you are asking. It has four losses, mAP @ 0.5 and mAP @ 0.50:0.95.

And yes, you can use the %%writefile method to overwrite your own modifications. It will work.

sovit-123 avatar Nov 18 '22 09:11 sovit-123

@sovit-123 I run a training but the generated csv file has just 8 columns which are epoch | map | map_05 | train loss | train cls loss | train box reg loss | train obj loss | train rpn loss

I need to get all of columns inside the csv that I attached above. I need to see train-val performances over the same plot for analyzing epochs and overfitting. Did you make another update for that?

Or which py file should I focus on to make this updates myself.

Thank you!

uysalfurkan avatar Nov 18 '22 09:11 uysalfurkan

@uysalfurkan I am not yet analyzing the validation values. It will take some time. It requires modification in the validation function inside the engine.py script. This is because Faster RCNN models do not output any validation loss values in model.eval() mode. They do so only in the model.train mode.

sovit-123 avatar Nov 18 '22 10:11 sovit-123

@sovit-123 Ok, I'm waiting, I would be very happy if you let me know when the modification is finished. Also, it would be great if train mAP values are in the file as well as validation.

uysalfurkan avatar Nov 18 '22 10:11 uysalfurkan

@uysalfurkan mAP is a validation metric already. In object detection, we calculate mAP on the validation dataset only which is the case with this code base as well. I hope this helps.

sovit-123 avatar Nov 18 '22 11:11 sovit-123

@sovit-123 Hi, I'm a little confused after the conversation. How can I understand whether there is overfitting or not from the your result csv ?

uysalfurkan avatar Nov 21 '22 08:11 uysalfurkan

@uysalfurkan In object detection, you can check that a model is overfitting if the mAP starts decreasing instead of increasing. mAP is always calculated on the validation dataset.

sovit-123 avatar Nov 21 '22 12:11 sovit-123

Hi @sovit-123

I got the log below when I run the train.py with create_fasterrcnn_model as model.

UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked)) Traceback (most recent call last): File "train.py", line 505, in main(args) File "train.py", line 258, in main build_model = create_model[args['model']] KeyError: 'create_fasterrcnn_model'

uysalfurkan avatar Dec 04 '22 16:12 uysalfurkan

Hello @uysalfurkan That above warning says that are there 2 cores in the CPU but you are trying to use 4 cores (4 workers). I think this is a common warning on Kaggle. For now, you may ignore it or pass --workers 2 to the training command if you want to avoid this warning.

Regarding the KeyError. You need to pass a valid model name key to the --model flag in the train.py command. Looks like you have passed create_fasterrcnn_model as the key which is not valid. By default the key is fasterrcnn_resnet50_fpn_v2. You may also pass a model name key like this: python train.py --model fasterrcnn_resnet50_fpn <rest of the command> You can find all the model name keys that you can pass here: https://github.com/sovit-123/fasterrcnn-pytorch-training-pipeline#A-List-of-All-Model-Flags-to-Use-With-the-Training-Script

sovit-123 avatar Dec 05 '22 02:12 sovit-123

@sovit-123 Hi, I want to get mAP_0.5:0.90 rather than mAP_0.5:0.95. How can I change the code ?

uysalfurkan avatar Dec 30 '22 11:12 uysalfurkan

@uysalfurkan Hello, that would require changing the Pycocotools code. But at the moment, I cannot say for sure at which place we need to change the code.

sovit-123 avatar Dec 30 '22 13:12 sovit-123

Hi again,

After starting to run train.py command, I got the error below,

OSError: /opt/conda/lib/python3.7/site-packages/nvidia/cublas/lib/libcublas.so.11: undefined symbol: cublasLtHSHMatmulAlgoInit, version libcublasLt.so.11

How can I fix this ?

uysalfurkan avatar Dec 30 '22 20:12 uysalfurkan

@uysalfurkan Looks like CUDA issue. Which GPU do you have, is it an RTX or GTX GPU?

sovit-123 avatar Dec 31 '22 01:12 sovit-123

I am using Kaggle GPUs.

  • NVIDIA TESLA P100 GPU
  • TESLA T4 x2

uysalfurkan avatar Jan 02 '23 06:01 uysalfurkan

@uysalfurkan Ok. I understand the issue now. Looks like pip install -r requirements.txt is installing torch 1.13.1 which has issues with CUDA on Kaggle. This is because of this line in the file torch>=1.12.0, !=1.13.0 For now, PyTorch 1.12.0 works best. I will update the requirements file with torch==1.12.0 by end of day. You may also manually install it in the Kaggle environment and it will work fine.

sovit-123 avatar Jan 02 '23 10:01 sovit-123

Hi @sovit-123 thanks for your fast replies.

I got the error below and could not figure it out. I set the epoch number as 100 but at epoch 55 the process becomes fails. RuntimeError: DataLoader worker (pid 18111) is killed by signal: Killed.

uysalfurkan avatar Jan 11 '23 11:01 uysalfurkan

@uysalfurkan Are you running on Kaggle?

sovit-123 avatar Jan 11 '23 13:01 sovit-123

@sovit-123 Yes I am running on Kaggle

uysalfurkan avatar Jan 11 '23 14:01 uysalfurkan

@uysalfurkan I was also facing the issue yesterday. But never saw it before. Still meed to debug. Can you try --workers 2 and train again?

sovit-123 avatar Jan 11 '23 14:01 sovit-123