segment-anything icon indicating copy to clipboard operation
segment-anything copied to clipboard

CUDA memory error when batching

Open jefromson opened this issue 1 year ago • 10 comments

Using the batching method suggested in the prediction notebook I am getting a memory error. Is this the correct method? Or is there another way images should be batched?

Example to reproduce:

import torch
import torchvision

print("PyTorch version:", torch.__version__)
print("Torchvision version:", torchvision.__version__)
print("CUDA is available:", torch.cuda.is_available())

from segment_anything import sam_model_registry
import numpy as np

model_checkpoint = '/path/to/sam_vit_b_01ec64.pth'
sam = sam_model_registry["vit_b"](checkpoint=model_checkpoint).to(device='cuda')

images = np.zeros((10, 256, 256, 3), dtype='uint8')

batched_input = []
for i in range(images.shape[0]):
    batched_input.append(
        {
            'image': torch.as_tensor(images[i], device=sam.device).permute(2, 0, 1).contiguous(),
            'original_size': images[i].shape[:2],
        }
    )

batched_output = sam(batched_input, multimask_output=False)

Traceback:

PyTorch version: 1.13.1.post200
Torchvision version: 0.14.1a0+59d9189
CUDA is available: True
Traceback (most recent call last):
  File "/home/user/.config/JetBrains/PyCharmCE2023.1/scratches/20230415_SAM_batched.py", line 25, in <module>
    batched_output = sam(batched_input, multimask_output=False)
  File "/home/user/miniconda3/envs/sam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/miniconda3/envs/sam/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/git/segment-anything/segment_anything/modeling/sam.py", line 98, in forward
    image_embeddings = self.image_encoder(input_images)
  File "/home/user/miniconda3/envs/sam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/git/segment-anything/segment_anything/modeling/image_encoder.py", line 112, in forward
    x = blk(x)
  File "/home/user/miniconda3/envs/sam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/git/segment-anything/segment_anything/modeling/image_encoder.py", line 174, in forward
    x = self.attn(x)
  File "/home/user/miniconda3/envs/sam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/git/segment-anything/segment_anything/modeling/image_encoder.py", line 234, in forward
    attn = add_decomposed_rel_pos(attn, q, self.rel_pos_h, self.rel_pos_w, (H, W), (H, W))
  File "/home/user/git/segment-anything/segment_anything/modeling/image_encoder.py", line 358, in add_decomposed_rel_pos
    attn.view(B, q_h, q_w, k_h, k_w) + rel_h[:, :, :, :, None] + rel_w[:, :, :, None, :]
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.50 GiB (GPU 0; 10.91 GiB total capacity; 9.14 GiB already allocated; 69.44 MiB free; 9.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Process finished with exit code 1

jefromson avatar Apr 16 '23 04:04 jefromson

Maybe you need a better GPU device or reduce your batch-size. torch.utils.data.Dataset is a easier way to make a batched dataset.

Dongjie-Cheng avatar Apr 17 '23 02:04 Dongjie-Cheng

To be more specific I can run three 256x256x3 images through this batching script but 4+ fails. It seems each individual image is taking up 3GB of gpu ram which seems odd for relatively small images and makes me think I am initializing multiple sessions instead of batching inputs within one session as the prediction example suggests.

Thanks!

jefromson avatar Apr 17 '23 03:04 jefromson

How to improve the usage of GPU memory ? My GPU has 24268MiB , but less 8000MiB used . I used onnx to find mask , not pth (scripts/amg.py)

Aatroy avatar Apr 17 '23 07:04 Aatroy

name: my-app-install token id: my-app uses: getsentry/action-github-app-token@v2 with: app_id: ${{ secrets.APP_ID }} private_key: ${{ secrets.APP_PRIVATE_KEY }}

  • name: Checkout private repo uses: actions/checkout@v2 with: repository: getsentry/my-private-repo token: ${{ steps.my-app.outputs.token }}

HIMANSHUSINGHYANIA avatar Apr 17 '23 16:04 HIMANSHUSINGHYANIA

Have you solved this problem yet ?

cosmosmosco avatar Apr 24 '23 03:04 cosmosmosco

Hello, I have the same issue. Did you find a solution? Thank you

widedh avatar Apr 25 '23 13:04 widedh

I have not found a solution here yet.

jefromson avatar Apr 25 '23 18:04 jefromson

PyTorch version: 1.13.1.post200 Torchvision version: 0.14.1a0+59d9189 CUDA is available: True Traceback (most recent call last): File "/home/user/.config/JetBrains/PyCharmCE2023.1/scratches/20230415_SAM_batched.py", line 25, in batched_output = sam(batched_input, multimask_output=False) File "/home/user/miniconda3/envs/sam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/user/miniconda3/envs/sam/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/user/git/segment-anything/segment_anything/modeling/sam.py", line 98, in forward image_embeddings = self.image_encoder(input_images) File "/home/user/miniconda3/envs/sam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/user/git/segment-anything/segment_anything/modeling/image_encoder.py", line 112, in forward x = blk(x) File "/home/user/miniconda3/envs/sam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/user/git/segment-anything/segment_anything/modeling/image_encoder.py", line 174, in forward x = self.attn(x) File "/home/user/miniconda3/envs/sam/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/user/git/segment-anything/segment_anything/modeling/image_encoder.py", line 234, in forward attn = add_decomposed_rel_pos(attn, q, self.rel_pos_h, self.rel_pos_w, (H, W), (H, W)) File "/home/user/git/segment-anything/segment_anything/modeling/image_encoder.py", line 358, in add_decomposed_rel_pos attn.view(B, q_h, q_w, k_h, k_w) + rel_h[:, :, :, :, None] + rel_w[:, :, :, None, :] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.50 GiB (GPU 0; 10.91 GiB total capacity; 9.14 GiB already allocated; 69.44 MiB free; 9.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

HIMANSHUSINGHYANIA avatar Apr 26 '23 09:04 HIMANSHUSINGHYANIA

I have done some reverse engineering,

Change the batch size to reduce it points_per_batch (default is 64).

mask_generator = SamAutomaticMaskGenerator(sam,  points_per_batch=8)

Try to reduce the size image

You have to try the different combinations of dtype (dtype= has multiple combinations, look the code https://github.com/pytorch/pytorch/blob/main/torch/nn/modules/module.py ) Also non_blocking combination Example: torch.float16, torch.float32, torch.float64, torch.half, torch.cdouble, torch.complex128 ...) :

sam = sam_model_registry[KEY_model_type](checkpoint=sam_checkpoint)
gpu1 = torch.device("cuda:1")
sam.to(gpu1, dtype=torch.half, non_blocking=True)

It also helps to change the max_split_size_mb, also try different combinations. Example: 512, 256,128....

import os
os.environ['PYTORCH_CUDA_ALLOC_CONF']='max_split_size_mb:512' #'garbage_collection_threshold:0.8,max_split_size_mb:512'

It may also help to clear the cache at startup.

import torch
#https://stackoverflow.com/questions/59129812/how-to-avoid-cuda-out-of-memory-in-pytorch
torch.cuda.empty_cache()
torch.cuda.memory_summary(device=None, abbreviated=False)
import gc
gc.collect()

FINALLY , if nothing works, if you have a very small GPU (2gb) you have to switch it to CPU. take 3min per image but works

device = "cpu" # "cuda"
sam = sam_model_registry[KEY_model_type](checkpoint=sam_checkpoint)
sam.to(device=device)

Extra, some people say that this process can be optimized with https://pypi.org/project/accelerate/ , I have not tried it.

Leci37 avatar May 02 '23 12:05 Leci37

Clearing the cache made the most difference for me, but there is still an issue when a processed image has a gorillion+ masks and a high resolution. It's a bit hard sometimes to predict how many masks an image will have. I've seen a wide variety from the three models, where the biggest difference has been about 90 masks between vit_h (207) and vit_l (117) for the same image.

pinksloyd avatar May 05 '23 01:05 pinksloyd

Thanks for the input all.

I found my initial issue is I simply had my list forming outside of the loop and I needed to clear the batch with a given batch size.

Solution:

import torch
import torchvision

print("PyTorch version:", torch.__version__)
print("Torchvision version:", torchvision.__version__)
print("CUDA is available:", torch.cuda.is_available())

from segment_anything import sam_model_registry
import numpy as np

model_checkpoint = '/path/to/sam_vit_b_01ec64.pth'
sam = sam_model_registry["vit_b"](checkpoint=model_checkpoint).to(device='cuda')

images = np.zeros((10, 256, 256, 3), dtype='uint8')

batch_size = 3

batched_input = []
for i in range(images.shape[0]):
    batched_input.append(
        {
            'image': torch.as_tensor(images[i], device=sam.device).permute(2, 0, 1).contiguous(),
            'original_size': images[i].shape[:2],
        }
    )

    if len(batched_input) == batch_size:
        batched_output = sam(batched_input, multimask_output=False)

        batched_input = []
        torch.cuda.empty_cache()

SAM is simply a big model and my batch size was surprisingly low with a 1080TI 11GB.

jefromson avatar May 10 '23 01:05 jefromson

I tried to run sam_vit_b for training document 1024* pixel images, but it's still OOM with CUDA when the process run to 282 batch , GPU will run on 14.78GB OOM Tesla T4x2 on Kaggle , but in CPU it's work

aihackervn avatar Aug 08 '23 14:08 aihackervn

@jefromson @Leci37 @Dongjie-Cheng Brothers can share the complete batch inference script, thank you!

xxxming730 avatar Aug 26 '23 02:08 xxxming730

Even though I set "cpu" i still have this issue, like there is some other process calling the gpu

lsangreg avatar Mar 12 '24 08:03 lsangreg