CUDA out of Memory with low Memory Utilization (CUDA error: device-side assert triggered)
🐛 Describe the bug
First of all, thanks for creating such a fantastic open-source production server.
I'm reaching out due an unexpected issue I can't solve. I've been running a torch serve server in production for over a year (several million requests per week) and it's been working great, however, a few weeks ago it started crashing every 1-5 days.
I enabled export CUDA_LAUNCH_BLOCKING=1, and it gives me a CUDA error: device-side assert triggered, and CUDA out of memory when I move my data to the GPU. I also log, torch.cuda.max_memory_allocated(), and torch.cuda.memory_allocated().
I thought some unique edge case caused a memory leak, some mismatched shapes or NaN values when I moved to the GPU, or allocating too much memory. However, the models use 6180 MiB / 23028 MiB, and torch.cuda.max_memory_allocated logs around 366 MB.
When I SSH into an instance that has crashed it looks like this:
The memory is at 6180 MiB, the GPU utilization flickers between 0-16%, and it gives me the CUDA error: device-side assert triggered, and CUDA out of memory.
Unfortunately, I can't find a way to reproduce the error, it happens at random every 1-5 days, and I have to reset the server and allocate a new instance. I've done everything I can think of to check the data before allocating it to the GPU, and reducing any memory overload or potential memory leak.
Error logs
Installation instructions
Docker image:
Ubuntu 20.04 including Python 3.8 NVIDIA CUDA® 11.8.0 NVIDIA cuBLAS NVIDIA cuDNN NVIDIA NCCL 2.15.5 (optimized for NVIDIA NVLink®) NVIDIA RAPIDS™ 22.10.01 (For x86, only these libraries are included: cudf, xgboost, rmm, cuml, and cugraph.) Apex rdma-core 36.0 NVIDIA HPC-X 2.13 OpenMPI 4.1.4+ GDRCopy 2.3 TensorBoard 2.9.0 Nsight Compute 2022.3.0.0 Nsight Systems 2022.4.2.1 NVIDIA TensorRT™ 8.5.1 Torch-TensorRT 1.1.0a0 NVIDIA DALI® 1.20.0 MAGMA 2.6.2 JupyterLab 2.3.2 including Jupyter-TensorBoard TransformerEngine 0.3.0
Model Packaing
def create_pil_image(self, image_data):
image ="RGB")
return image
except IOError as e:
# If the image data is not valid or not provided, create a blank image.
width, height = 776, 776 # Set desired width and height for the blank image
color = (255, 255, 255) # Set desired color for the blank image (white in this case)
image ="RGB", (width, height), color)
return image
def preprocess_and_stack_images(self, images):
preprocessed_images = []
for i, img in enumerate(images):
preprocessed_img = self.resize_tensor(img)
if preprocessed_img.shape != (3, 768, 768) or preprocessed_img.min() < 0 or preprocessed_img.max() > 1:
# Log information about the image that doesn't meet the requirements"Image {i} does not meet the requirements. Replacing with a blank image.")
preprocessed_img = torch.zeros((3, 768, 768))
except Exception as e:
# Log the error message and load a blank image
logger.error(f"Error occurred while processing Image {i}: {str(e)}. Loading a blank image.")
preprocessed_img = torch.zeros((3, 768, 768))
images_batch = torch.stack(preprocessed_images, dim=0)
if len(images_batch.shape) == 3:
images_batch = images_batch.unsqueeze(0)
return images_batch
def preprocess(self, data):
images = []
fns = []
texts = []
size = []
merges = []
org_images = []
watermarks = []
white_balance_list = []
auto_color_list = []
temperature_list = []
saturation_list = []
for row in data:
image = row["image"]
fn = self.decode_field(row["fn"])
text = self.decode_field(row["text"])
merged = self.decode_field(row["merged"])
merged = True if merged.lower() == 'true' else False
resolution = self.decode_field(row["resolution"])
white_balance = self.decode_field(row["white_balance"])
auto_color = self.decode_field(row["auto_color"])
temperature = float(self.decode_field(row["temperature"]))
saturation = float(self.decode_field(row["saturation"]))
auto_color = True if auto_color == 'true' else False
white_balance = True if white_balance == 'true' else False
watermark = True if 'watermarked' in resolution else False
if isinstance(image, str):"Image data should not be a string. Please provide the image data as bytes.")
width, height = 224, 224 # Set desired width and height for the blank image
color = (255, 255, 255) # Set desired color for the blank image (white in this case)
image ="RGB", (width, height), color)
if isinstance(image, (bytearray, bytes)):
image = self.create_pil_image(image)
image = self.resize_image(image, resolution)
texts_raw = self.tokenizer(texts) #type(torch.int32)
texts = self.token_embedding(texts_raw).type(torch.float16)
texts = texts + self.positional_embedding.type(torch.float16)
images_batch = self.preprocess_and_stack_images(images)
The error comes when I move the images_batch to GPU
"palette_caption": {
"1.0": {
"defaultVersion": true,
"marName": "palette_caption.mar",
"minWorkers": 1,
"maxWorkers": 3,
"batchSize": 4,
"maxBatchDelay": 20,
"responseTimeout": 180
"palette_colorizer": {
"1.0": {
"defaultVersion": true,
"marName": "palette_colorizer.mar",
"minWorkers": 2,
"maxWorkers": 4,
"batchSize": 4,
"maxBatchDelay": 20,
"responseTimeout": 120
"palette_ref_colorizer": {
"1.0": {
"defaultVersion": true,
"marName": "palette_ref_colorizer.mar",
"minWorkers": 1,
"maxWorkers": 2,
"batchSize": 4,
"maxBatchDelay": 20,
"responseTimeout": 120
Pip freeze:
Repro instructions
Unfortunately, I can't find a way to reproduce the error, it randomly appears every 1-5 days.
Possible Solution
There are a few things that are a bit odd about this issue:
- The server has been running fine for over a year, I only made a few updates a few months back, and all of a sudden it started crashing frequently
- I thought it was some edge case that crashed the server, but it only crashes some of the running instances
- It happens randomly every 1-5 days, that's why I assumed it was some memory leak, but I can't find any evidence of it
- I get a device-side assert triggered, and CUDA out of memory, however the available memory seems to be plenty, and I check for any NaN value or wrong shape before placing it on the GPU.
I've run out of ideas, any thought or feedback would be much appreciated.
Hi @emilwallner, thanks for the extensive issue report.
My thought on this are:
- You're looking at the server after the crash, right? Meaning that the worker process has died, gets restarted and and thus memory is back to normal.
- I can't find the line from your stack trace in your code but I assume that its basically the next line from your code. Detach does not create a copy of the data so you should still be having a single batch on device.
- You're resizing the images with a resolution coming from the requests and then re-resizing the tensor in preprocess_and_stack_images to (3,768,768). Then you're stacking them along the channel dimension creating e.g. (6,768,768) before you add a batch dimension with unsqueeze. Not sure about your model by maybe it does something funky when it gets (1,6,768,768) instead of(2,3,768,768).
- What is your batch size? Did you try using batch_size=1 for some time?
- In the video there are multiple processes on the GPU, do you use multiple worker for the same model?
Thats all I have for now but happy to continue spitballing and iterating over this until you find s solution!
Best Matthias
Really, really appreciate your input, @mreso!
- The worker crashes and returns 507 and doesn't recover.
- Yeah, I added detach to make sure requires_grad is set to False
- Yeah, that could be it
- I switched the batch size to 1 following your suggestion. Also, I check that it has the correct type, and final batch size.
- Yes, multiple workers per model.
I also realized CUDA_LAUNCH_BLOCKING 1 reduces performance by about 70%, so I'll turn it off for now.
Here's my updated check:
def preprocess_and_stack_images(self, images):
preprocessed_images = []
for i, img in enumerate(images):
preprocessed_img = self.resize_tensor(img)
if preprocessed_img.shape != (3, 768, 768) or preprocessed_img.min() < 0 or preprocessed_img.max() > 1 or preprocessed_img.dtype != torch.float32:
# Log information about the image that doesn't meet the requirements"Image {i} does not meet the requirements. Replacing with a blank image.")
preprocessed_img = torch.zeros((3, 768, 768))
except Exception as e:
# Log the error message and load a blank image
logger.error(f"Error occurred while processing Image {i}: {str(e)}. Loading a blank image.")
preprocessed_img = torch.zeros((3, 768, 768))
images_batch = torch.stack(preprocessed_images, dim=0)
if len(images_batch.shape) == 3:
images_batch = images_batch.unsqueeze(0)
# Second test: Check if the size is (1, 3, 768, 768)
if images_batch.shape != (1, 3, 768, 768):
# Log information about the batch that doesn't meet the requirements"Batch size {images_batch.shape} does not match the required shape (1, 3, 768, 768). Replacing with a blank batch.")
images_batch = torch.zeros((1, 3, 768, 768))
return images_batch
Again, really appreciate the brainstorming — let’s keep at it until we crack this!
Yeah, performance will suffer significant from CUDA_LAUNCH_BLOCKING as kernels will not run asynchronously. So only activate if really necessary for debugging.
You could try to run the model in a notebook with a (1,6,768,768) input and observe the memory usage compared to (2,3,768,768). Wondering why this actually seem to to work in the first place.
I haven’t tried the (1,6,768,768) input yet, but since our model is based on three channels, it should throw an error during execution.
Now, I double-check the size (1,3,768,768), dtype, and ensured the values are in the correct range. Despite that, I’m still hitting a CUDA error: device-side assert triggered when moving the batch with images_batch =
Got any more suggestions on what might be causing this?
Cross-post from here with a stacktrace pointing to a real indexing error.
Please check {{management_address}}/models/<registered_model_name> endpoint and monitor the following
"jobQueueStatus": { "remainingCapacity": 100, "pendingRequests": 0 }
I found this issue appears randomly when pendingRequests does not increases.