cosypose icon indicating copy to clipboard operation
cosypose copied to clipboard

RunTimeError: CUDA out of memory // Requirements on Graphic card?

Open HartmannSa opened this issue 4 years ago • 10 comments

Hi,

while executing python -m cosypose.scripts.run_cosypose_eval --config tless-siso I receive the following error message:

RuntimeError: CUDA out of memory. Tried to allocate 1.35 GiB (GPU 0; 5.93 GiB total capacity; 1.47 GiB already allocated; 866.50 MiB free; 36.31 MiB cached)

According to my internet research a reduction of the batch size is recommended. However, I don't know where to set it and in my understanding the batch size shouldn't play any role in this command, since I use the already pre-trained network?!

Could the cause of the error be that there are certain hardware requirements for reproducing the results? I am using Ubuntu 18.04.5 LTS and an NVIDIA GeForce GTX 1060 6GB (and the nvidia-driver-450).

Here is a larger part of my terminal output:

1:06:35.398140 - Scene: [6] 1:06:35.398203 - Views: [359] 1:06:35.398260 - Group: [2732] 1:06:35.398285 - Image has 5 gt detections. (not used) 1:06:35.701966 - Pose prediction on 4 detections (n_iterations=1): 0:00:00.063503 1:06:35.954221 - Pose prediction on 4 detections (n_iterations=4): 0:00:00.250793 1:06:35.720832 - -------------------------------------------------------------------------------- 100%|███████████████████████████████████████████████████████████| 10080/10080 [1:06:24<00:00, 2.53it/s] 1:06:47.763242 - Done with predictions 100%|█████████████████████████████████████████████████████████████| 10080/10080 [39:28<00:00, 4.26it/s] 1:46:18.765271 - Skipped: pix2pose_detections/coarse/iteration=1 (N=50023) 1:46:18.765351 - Skipped: pix2pose_detections/refiner/iteration=1 (N=50023) 1:46:18.765377 - Skipped: pix2pose_detections/refiner/iteration=2 (N=50023) 1:46:18.765398 - Skipped: pix2pose_detections/refiner/iteration=3 (N=50023) 1:46:18.765419 - Evaluation : pix2pose_detections/refiner/iteration=4 (N=50023) 0%| | 0/10080 [00:00<?, ?it/s] Traceback (most recent call last): File "/home/rosmatch/anaconda3/envs/cosypose/lib/python3.7/runpy.py", line 193, in run_module_as_main "main", mod_spec) File "/home/rosmatch/anaconda3/envs/cosypose/lib/python3.7/runpy.py", line 85, in run_code exec(code, run_globals) File "/home/rosmatch/cosypose/cosypose/scripts/run_cosypose_eval.py", line 491, in main() File "/home/rosmatch/cosypose/cosypose/scripts/run_cosypose_eval.py", line 433, in main eval_metrics[preds_k], eval_dfs[preds_k] = eval_runner.evaluate(preds) File "/home/rosmatch/cosypose/cosypose/evaluation/eval_runner/pose_eval.py", line 67, in evaluate meter.add(obj_predictions, obj_data_gt.to(device)) File "/home/rosmatch/cosypose/cosypose/evaluation/meters/pose_meters.py", line 172, in add cand_infos['label'].values) File "/home/rosmatch/cosypose/cosypose/evaluation/meters/pose_meters.py", line 101, in compute_errors_batch errors.append(self.compute_errors(TXO_pred, TXO_gt, labels_)) File "/home/rosmatch/cosypose/cosypose/evaluation/meters/pose_meters.py", line 70, in compute_errors dists = dists_add_symmetric(TXO_pred, TXO_gt, points) File "/home/rosmatch/cosypose/cosypose/lib3d/distances.py", line 16, in dists_add_symmetric dists_norm_squared = (dists ** 2).sum(dim=-1) RuntimeError: CUDA out of memory. Tried to allocate 1.35 GiB (GPU 0; 5.93 GiB total capacity; 1.47 GiB already allocated; 866.50 MiB free; 36.31 MiB cached)

HartmannSa avatar Nov 26 '20 14:11 HartmannSa

Hello, i have the same issue, did u fix it ? thanks a lot for your answer

salimkhazem avatar Mar 24 '21 11:03 salimkhazem

Same here

JohannesAma avatar Apr 29 '21 10:04 JohannesAma

This may be done by changing the batch_size in run_pose_training.py.

yupei-git avatar May 06 '21 17:05 yupei-git

I solved this with following changes: bullet_batch_renderer.py -> workers 8 to 1 multiview_predictor.py -> batch size(nsym) 64 to 1 run_bop_inference.py -> workers 8 to 1

JohannesAma avatar May 06 '21 17:05 JohannesAma

Same here. Is there any other suggestion? Unfortunately, Johannes's solution didnt work for me. @JohannesAma did it really work for you for the siso tless case?

AlexandraPapadaki avatar Jun 12 '21 22:06 AlexandraPapadaki

Same here. Is there any other suggestion? Unfortunately, Johannes's solution didnt work for me. @JohannesAma did it really work for you for the siso tless case?

My nvidia card has 8gb of storage, maybe yours is smaller and you have to reduce batch size and workers in some more modules that are used in the siso tless case.

JohannesAma avatar Jun 16 '21 12:06 JohannesAma

I have the same problem and the suggested solution didnt work. Is there any solution ??? thanks in advance

smoothumut avatar Feb 14 '22 12:02 smoothumut

Im sorry I dont know about another solution Workers and batchsize are the parameters which define the load on the grafics card Maybe you have to set them even smaller.

JohannesAma avatar Feb 14 '22 13:02 JohannesAma

The main reason for this problem is that the data set evaluated by the evaluation is too large, and the GPU memory for running the program is less than 8GB. The The root cause is this line of code: run_cosypose_eval.py Line 443 eval_metrics[preds_k], eval_dfs[preds_k] = eval_runner.evaluate(preds)

Possible Solution:

  1. Go to folder "local_data" to delete some data. Then perform pre-training, usually, the results will not be a problem, and then execute the process of evaluation again.
  2. Discard GPU usage. Transfer all data, models to the CPU (requires constant code debugging)
  3. Modify the model to use AMP. However, the workload is large, and it is easy to cause the entire program to be difficult to execute if you are not careful.

In fact, the process of performing evaluation is not just to verify whether the results are correct. This model can be used to evaluate other datasets, and if the results are correct, it is not good. The main modification part is LOCAL_DATA_DIR.

nturaymond avatar Mar 31 '22 07:03 nturaymond