candle Metal: bound temporary buffer cache and prevent runaway memory usage on large softmax/broadcast/matmul workloads

Description:

On the Metal backend, large transformer/VLM workloads (e.g. Dots‑style OCR with a heavy vision tower and Qwen2‑style text tower) can cause the process RSS on macOS to grow to tens or even hundreds of GiB during a single forward pass, even though the model’s working set should fit comfortably in memory.

I use Instruments traces on a Dots‑style model show that the bulk of the “resource size” comes from a chain of large tensor ops:

candle_nn::ops::softmax on 3D tensors [batch * heads, seq_len, total_len]
the underlying elementwise ops:
- Tensor::max_keepdim
- Tensor::broadcast_sub
- Tensor::exp
- Tensor::sum_keepdim
- Tensor::broadcast_div
plus large Tensor::matmul and broadcast_mul on matching shapes

Click to expand raw data

Resource Size	Self Resource Size	Symbol Names
390.07 GiB  66.5%	0 Bytes	                                       _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h56c4e8aa02cfcf3b
250.01 GiB  42.6%	0 Bytes	                                        candle_nn::ops::softmax::hf1a7e8c20702ed6b
84.00 GiB  14.3%	0 Bytes	                                         candle_core::tensor::Tensor::broadcast_div::h1f0e384f7eef2e07
84.00 GiB  14.3%	0 Bytes	                                          candle_core::tensor::Tensor::div::h9a2bcb60b7c25bce
84.00 GiB  14.3%	0 Bytes	                                           candle_core::storage::Storage::binary_impl::hfda09b175c7c7058
84.00 GiB  14.3%	0 Bytes	                                            _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h35f1c58c739baa69
84.00 GiB  14.3%	0 Bytes	                                             candle_core::metal_backend::MetalStorage::binary::h8a4fc6643c45827f
84.00 GiB  14.3%	0 Bytes	                                              _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
84.00 GiB  14.3%	0 Bytes	                                               candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
84.00 GiB  14.3%	0 Bytes	                                                candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
84.00 GiB  14.3%	0 Bytes	                                                 objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
84.00 GiB   0.0%	0 Bytes	                                                  _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
84.00 GiB   0.0%	0 Bytes	                                                   objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
84.00 GiB   0.0%	0 Bytes	                                                    objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
84.00 GiB   0.0%	0 Bytes	                                                     _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
84.00 GiB   0.0%	0 Bytes	                                                      -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
84.00 GiB   0.0%	0 Bytes	                                                       -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
84.00 GiB   0.0%	0 Bytes	                                                        -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
84.00 GiB   0.0%	0 Bytes	                                                         -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
84.00 GiB   0.0%	0 Bytes	                                                          -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
84.00 GiB   0.0%	84.00 GiB	                                                           __kdebug_trace64
84.00 GiB  14.3%	0 Bytes	                                         candle_core::tensor::Tensor::exp::hadf5e168faedee1f
84.00 GiB  14.3%	0 Bytes	                                          candle_core::storage::Storage::unary_impl::h736da20777c1466c
84.00 GiB  14.3%	0 Bytes	                                           _$LT$candle_core..metal_backend..MetalStorage$u20$as$u20$candle_core..backend..BackendStorage$GT$::unary_impl::he570d8aad3e1c738
84.00 GiB  14.3%	0 Bytes	                                            _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
84.00 GiB  14.3%	0 Bytes	                                             candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
84.00 GiB  14.3%	0 Bytes	                                              candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
84.00 GiB  14.3%	0 Bytes	                                               objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
84.00 GiB   0.0%	0 Bytes	                                                _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
84.00 GiB   0.0%	0 Bytes	                                                 objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
84.00 GiB   0.0%	0 Bytes	                                                  objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
84.00 GiB   0.0%	0 Bytes	                                                   _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
84.00 GiB   0.0%	0 Bytes	                                                    -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
84.00 GiB   0.0%	0 Bytes	                                                     -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
84.00 GiB   0.0%	0 Bytes	                                                      -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
84.00 GiB   0.0%	0 Bytes	                                                       -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
84.00 GiB   0.0%	0 Bytes	                                                        -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
84.00 GiB   0.0%	84.00 GiB	                                                         __kdebug_trace64
82.00 GiB  14.0%	0 Bytes	                                         candle_core::tensor::Tensor::broadcast_sub::heb9fb3e79a80bbd1
82.00 GiB   0.0%	0 Bytes	                                          candle_core::tensor::Tensor::sub::h571175f735b5dfbb
82.00 GiB   0.0%	0 Bytes	                                           candle_core::storage::Storage::binary_impl::h5d345597b9bcf69f
82.00 GiB   0.0%	0 Bytes	                                            _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h35f1c58c739baa69
82.00 GiB   0.0%	0 Bytes	                                             candle_core::metal_backend::MetalStorage::binary::h8a4fc6643c45827f
82.00 GiB   0.0%	0 Bytes	                                              _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
82.00 GiB   0.0%	0 Bytes	                                               candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
82.00 GiB   0.0%	0 Bytes	                                                candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
82.00 GiB   0.0%	0 Bytes	                                                 objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
82.00 GiB   0.0%	0 Bytes	                                                  _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
82.00 GiB   0.0%	0 Bytes	                                                   objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
82.00 GiB   0.0%	0 Bytes	                                                    objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
82.00 GiB   0.0%	0 Bytes	                                                     _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
82.00 GiB   0.0%	0 Bytes	                                                      -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
82.00 GiB   0.0%	0 Bytes	                                                       -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
82.00 GiB   0.0%	0 Bytes	                                                        -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
82.00 GiB   0.0%	0 Bytes	                                                         -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
82.00 GiB   0.0%	0 Bytes	                                                          -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
82.00 GiB   0.0%	82.00 GiB	                                                           __kdebug_trace64
7.50 MiB   0.0%	0 Bytes	                                         candle_core::tensor::Tensor::sum_keepdim::h648a11aa78a7272d
7.50 MiB   0.0%	0 Bytes	                                          candle_core::tensor::Tensor::sum_impl::h56c45f280ce49025
7.50 MiB   0.0%	0 Bytes	                                           candle_core::storage::Storage::reduce_op::ha2d9bc02bdba85cb
7.50 MiB   0.0%	0 Bytes	                                            _$LT$candle_core..metal_backend..MetalStorage$u20$as$u20$candle_core..backend..BackendStorage$GT$::reduce_op::h6c1fa34e549af913
7.50 MiB   0.0%	0 Bytes	                                             _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
7.50 MiB   0.0%	0 Bytes	                                              candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
7.50 MiB   0.0%	0 Bytes	                                               candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
7.50 MiB   0.0%	0 Bytes	                                                objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
7.50 MiB   0.0%	0 Bytes	                                                 _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
7.50 MiB   0.0%	0 Bytes	                                                  objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
7.50 MiB   0.0%	0 Bytes	                                                   objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
7.50 MiB   0.0%	0 Bytes	                                                    _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
7.50 MiB   0.0%	0 Bytes	                                                     -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
7.50 MiB   0.0%	0 Bytes	                                                      -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
7.50 MiB   0.0%	0 Bytes	                                                       -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
7.50 MiB   0.0%	0 Bytes	                                                        -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
7.50 MiB   0.0%	0 Bytes	                                                         -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
7.50 MiB   0.0%	7.50 MiB	                                                          __kdebug_trace64
1.50 MiB   0.0%	0 Bytes	                                         candle_core::tensor::Tensor::max_keepdim::ha1543e08396d8976
1.50 MiB   0.0%	0 Bytes	                                          candle_core::tensor::Tensor::reduce_impl::h9b9a30dfacd26553
1.50 MiB   0.0%	0 Bytes	                                           candle_core::storage::Storage::reduce_op::ha2d9bc02bdba85cb
1.50 MiB   0.0%	0 Bytes	                                            _$LT$candle_core..metal_backend..MetalStorage$u20$as$u20$candle_core..backend..BackendStorage$GT$::reduce_op::h6c1fa34e549af913
1.50 MiB   0.0%	0 Bytes	                                             _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
1.50 MiB   0.0%	0 Bytes	                                              candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
1.50 MiB   0.0%	0 Bytes	                                               candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
1.50 MiB   0.0%	0 Bytes	                                                objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
1.50 MiB   0.0%	0 Bytes	                                                 _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
1.50 MiB   0.0%	0 Bytes	                                                  objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
1.50 MiB   0.0%	0 Bytes	                                                   objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
1.50 MiB   0.0%	0 Bytes	                                                    _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
1.50 MiB   0.0%	0 Bytes	                                                     -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
1.50 MiB   0.0%	0 Bytes	                                                      -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
1.50 MiB   0.0%	0 Bytes	                                                       -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
1.50 MiB   0.0%	0 Bytes	                                                        -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
1.50 MiB   0.0%	0 Bytes	                                                         -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
1.50 MiB   0.0%	1.50 MiB	                                                          __kdebug_trace64
76.00 GiB  13.0%	0 Bytes	                                        candle_core::tensor::Tensor::broadcast_mul::h87612ac4e7a9271d
76.00 GiB   0.0%	0 Bytes	                                         candle_core::tensor::Tensor::mul::h0369c368b543f4c8
76.00 GiB   0.0%	0 Bytes	                                          candle_core::storage::Storage::binary_impl::h3594d7ee262558eb
76.00 GiB   0.0%	0 Bytes	                                           _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h35f1c58c739baa69
76.00 GiB   0.0%	0 Bytes	                                            candle_core::metal_backend::MetalStorage::binary::h8a4fc6643c45827f
76.00 GiB   0.0%	0 Bytes	                                             _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
76.00 GiB   0.0%	0 Bytes	                                              candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
76.00 GiB   0.0%	0 Bytes	                                               candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
76.00 GiB   0.0%	0 Bytes	                                                objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
76.00 GiB   0.0%	0 Bytes	                                                 _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
76.00 GiB   0.0%	0 Bytes	                                                  objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
76.00 GiB   0.0%	0 Bytes	                                                   objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
76.00 GiB   0.0%	0 Bytes	                                                    _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
76.00 GiB   0.0%	0 Bytes	                                                     -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
76.00 GiB   0.0%	0 Bytes	                                                      -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
76.00 GiB   0.0%	0 Bytes	                                                       -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
76.00 GiB   0.0%	0 Bytes	                                                        -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
76.00 GiB   0.0%	0 Bytes	                                                         -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
76.00 GiB   0.0%	76.00 GiB	                                                          __kdebug_trace64
64.00 GiB  10.9%	0 Bytes	                                        candle_core::tensor::Tensor::matmul::h78ebc324e001cd34
64.00 GiB   0.0%	0 Bytes	                                         candle_core::storage::Storage::matmul::h63e24b756488a23b
64.00 GiB   0.0%	0 Bytes	                                          _$LT$candle_core..metal_backend..MetalStorage$u20$as$u20$candle_core..backend..BackendStorage$GT$::matmul::hcdf3ad195cd136a7
64.00 GiB   0.0%	0 Bytes	                                           _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
64.00 GiB   0.0%	0 Bytes	                                            candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
64.00 GiB   0.0%	0 Bytes	                                             candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
64.00 GiB   0.0%	0 Bytes	                                              objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
64.00 GiB   0.0%	0 Bytes	                                               _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
64.00 GiB   0.0%	0 Bytes	                                                objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
64.00 GiB   0.0%	0 Bytes	                                                 objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
64.00 GiB   0.0%	0 Bytes	                                                  _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
64.00 GiB   0.0%	0 Bytes	                                                   -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
64.00 GiB   0.0%	0 Bytes	                                                    -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
64.00 GiB   0.0%	0 Bytes	                                                     -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
64.00 GiB   0.0%	0 Bytes	                                                      -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
64.00 GiB   0.0%	0 Bytes	                                                       -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
64.00 GiB   0.0%	64.00 GiB	                                                        __kdebug_trace64
64.00 MiB   0.0%	0 Bytes	                                        candle_core::tensor::Tensor::reshape::h6f8d982e8766aede
64.00 MiB   0.0%	0 Bytes	                                         candle_core::device::Device::alloc_uninit::h6c7d3c31740be006
64.00 MiB   0.0%	0 Bytes	                                          candle_core::metal_backend::_$LT$impl$u20$candle_core..backend..BackendDevice$u20$for$u20$candle_core..metal_backend..device..MetalDevice$GT$::alloc_uninit::h3925f8c3eca38242
64.00 MiB   0.0%	0 Bytes	                                           _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
64.00 MiB   0.0%	0 Bytes	                                            candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
64.00 MiB   0.0%	0 Bytes	                                             candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
64.00 MiB   0.0%	0 Bytes	                                              objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
64.00 MiB   0.0%	0 Bytes	                                               _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
64.00 MiB   0.0%	0 Bytes	                                                objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
64.00 MiB   0.0%	0 Bytes	                                                 objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
64.00 MiB   0.0%	0 Bytes	                                                  _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
64.00 MiB   0.0%	0 Bytes	                                                   -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
64.00 MiB   0.0%	0 Bytes	                                                    -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
64.00 MiB   0.0%	0 Bytes	                                                     -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
64.00 MiB   0.0%	0 Bytes	                                                      -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
64.00 MiB   0.0%	0 Bytes	                                                       -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
64.00 MiB   0.0%	64.00 MiB	                                                        __kdebug_trace64

Each of these ops, on the Metal backend, allocates its output via:

MetalStorage::unary_impl / MetalStorage::binary / MetalStorage::matmul
MetalDevice::new_buffer → MetalDevice::allocate_buffer
Device::new_buffer → MTLDevice::newBufferWithLength_options

Because command buffers are only flushed based on a fixed “kernels per command buffer” counter and we don’t track allocation volume, long sequences of these ops can allocate many large temporary buffers before any flush happens. Those buffers only become reusable after a distant flush, so peak RSS grows roughly with the sum of all intermediate outputs in a stage rather than the maximum.

Minimum exmaples to reproduce this issue

Cargo.toml

[dependencies]
candle-core = { git = "https://github.com/huggingface/candle", default-features = false,features = ["accelerate","metal"] }
candle-nn = { git = "https://github.com/huggingface/candle", default-features = false,features = ["accelerate","metal"] }

main.rs

use candle_core::{DType, Device, Result, Tensor, D};

fn main() -> Result<()> {
    let dev = Device::new_metal(0)?;
    let dtype = DType::BF16;

    let batch = 1usize;
    let num_heads = 12usize;
    let head_dim = 1536 / num_heads;
    let seq_len = 8192usize;
    let total_len = seq_len;

    let bh = batch * num_heads;

    let q_flat = Tensor::randn(0f32, 1f32, (bh, seq_len, head_dim), &dev)?.to_dtype(dtype)?;
    let k_tiled = Tensor::randn(0f32, 1f32, (bh, head_dim, total_len), &dev)?.to_dtype(dtype)?;
    let v_flat = Tensor::randn(0f32, 1f32, (bh, total_len, head_dim), &dev)?.to_dtype(dtype)?;

    let use_mask = true;
    let mask = if use_mask {
        Some(Tensor::randn(0f32, 1f32, (batch, 1, seq_len, total_len), &dev)?.to_dtype(dtype)?)
    } else {
        None
    };

    for _layer in 0..32 {
        let mut scores = q_flat.matmul(&k_tiled)?;
        let scale = 1.0f64 / (head_dim as f64).sqrt();
        let scale_tensor = Tensor::full(scale as f32, (), scores.device())?.to_dtype(dtype)?;
        scores = scores.broadcast_mul(&scale_tensor)?;
        if let Some(mask) = &mask {
            let expanded = mask
                .expand((batch, num_heads, seq_len, total_len))?
                .reshape((bh, seq_len, total_len))?;
            scores = scores.add(&expanded)?;
        }

        let probs = candle_nn::ops::softmax(&scores, D::Minus1)?;
        let _ctx = probs.matmul(&v_flat)?;
    }

    dev.synchronize()?;
    Ok(())
}

Whats changed

This PR introduces a simple allocation policy for the Metal backend so that, once a configurable amount of new buffer memory has been allocated, we automatically synchronize and trim the reuse cache, giving the existing Metal buffer pooling a chance to recycle large temporaries and preventing runaway memory growth.

Nov 17 '25 22:11 TimmyOVO

Look @pcuenca, it's the power of open source ✨ (Ref pointing out the memory spikes just the other day☺️)

Nov 18 '25 11:11 ivarflakstad

This looks good to me. Just want to try it out in various contexts first :)

Nov 18 '25 11:11 ivarflakstad

Indeed, I saw several softmax buffers when checking with Instruments!

Nov 18 '25 12:11 pcuenca

It might be that sysctl_usize behaves correctly in all cases, if you want to get rid of the u64 -> usize conversions. Worth trying. Btw, I can't guarantee the code I provided via the review is completely correct. It's all freehand :)

Nov 18 '25 22:11 ivarflakstad

It might be that sysctl_usize behaves correctly in all cases, if you want to get rid of the u64 -> usize conversions. Worth trying. Btw, I can't guarantee the code I provided via the review is completely correct. It's all freehand :)

Thanks for catching that. I'll look into removing those u64 to usize conversions first thing tomorrow. It's 8 AM here and I need to get some sleep, so I'd rather tackle this with a fresh mind.

Nov 18 '25 23:11 TimmyOVO

I did a quick test, this is what I saw:

The memory profile is much better behaved on macOS. This is what it looks like running Stable Diffusion:

The last spike is the VAE decoding phase; it's normal that memory grows. Memory is ~stable during the UNet forward steps. However, it increases a bit each step. This could point to some memory leak somewhere (including my test code).

On iOS, I'm seeing deadlocks again. This may be a regression after the fixes in #3164 (not necessarily from this PR, I didn't check)
I think we need to add dep:libc to candle-core/Cargo.toml for the metal feature.

I think we can merge this PR if we can determine that the iOS regression was introduced elsewhere; I'll run some more tests about that.

Nov 21 '25 18:11 pcuenca

I think we need to add dep:libc to candle-core/Cargo.toml for the metal feature.

Rust std depends libc, so it's only when you're working with no_std targets that you would have to manually add the libc dependency (interesting situation where you both want no_std and libc, but it has probably happened plenty of times in the wild).

Nov 23 '25 10:11 ivarflakstad

@TimmyOVO we haven't forgotten about you! :)

@pcuenca has been swamped recently so he hasn't had time to look into the issue he detected. Once he has time and we figure out the issue (if there is one) this will get merged asap

Dec 03 '25 11:12 ivarflakstad

it's only when you're working with no_std targets that you would have to manually add the libc dependency

Also I this is a big fat lie. The actual answer is that all these changes are in candle-core which already has the libc dependency.

Dec 03 '25 11:12 ivarflakstad

candle candle copied to clipboard

Metal: bound temporary buffer cache and prevent runaway memory usage on large softmax/broadcast/matmul workloads

Description:

Minimum exmaples to reproduce this issue

Whats changed

candle
candle copied to clipboard