candle
candle copied to clipboard
Metal: bound temporary buffer cache and prevent runaway memory usage on large softmax/broadcast/matmul workloads
Description:
On the Metal backend, large transformer/VLM workloads (e.g. Dots‑style OCR with a heavy vision tower and Qwen2‑style text tower) can cause the process RSS on macOS to grow to tens or even hundreds of GiB during a single forward pass, even though the model’s working set should fit comfortably in memory.
I use Instruments traces on a Dots‑style model show that the bulk of the “resource size” comes from a chain of large tensor ops:
candle_nn::ops::softmaxon 3D tensors[batch * heads, seq_len, total_len]- the underlying elementwise ops:
Tensor::max_keepdimTensor::broadcast_subTensor::expTensor::sum_keepdimTensor::broadcast_div
- plus large
Tensor::matmulandbroadcast_mulon matching shapes
Click to expand raw data
Resource Size Self Resource Size Symbol Names
390.07 GiB 66.5% 0 Bytes _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h56c4e8aa02cfcf3b
250.01 GiB 42.6% 0 Bytes candle_nn::ops::softmax::hf1a7e8c20702ed6b
84.00 GiB 14.3% 0 Bytes candle_core::tensor::Tensor::broadcast_div::h1f0e384f7eef2e07
84.00 GiB 14.3% 0 Bytes candle_core::tensor::Tensor::div::h9a2bcb60b7c25bce
84.00 GiB 14.3% 0 Bytes candle_core::storage::Storage::binary_impl::hfda09b175c7c7058
84.00 GiB 14.3% 0 Bytes _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h35f1c58c739baa69
84.00 GiB 14.3% 0 Bytes candle_core::metal_backend::MetalStorage::binary::h8a4fc6643c45827f
84.00 GiB 14.3% 0 Bytes _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
84.00 GiB 14.3% 0 Bytes candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
84.00 GiB 14.3% 0 Bytes candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
84.00 GiB 14.3% 0 Bytes objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
84.00 GiB 0.0% 0 Bytes _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
84.00 GiB 0.0% 0 Bytes objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
84.00 GiB 0.0% 0 Bytes objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
84.00 GiB 0.0% 0 Bytes _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
84.00 GiB 0.0% 0 Bytes -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
84.00 GiB 0.0% 0 Bytes -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
84.00 GiB 0.0% 0 Bytes -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
84.00 GiB 0.0% 0 Bytes -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
84.00 GiB 0.0% 0 Bytes -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
84.00 GiB 0.0% 84.00 GiB __kdebug_trace64
84.00 GiB 14.3% 0 Bytes candle_core::tensor::Tensor::exp::hadf5e168faedee1f
84.00 GiB 14.3% 0 Bytes candle_core::storage::Storage::unary_impl::h736da20777c1466c
84.00 GiB 14.3% 0 Bytes _$LT$candle_core..metal_backend..MetalStorage$u20$as$u20$candle_core..backend..BackendStorage$GT$::unary_impl::he570d8aad3e1c738
84.00 GiB 14.3% 0 Bytes _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
84.00 GiB 14.3% 0 Bytes candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
84.00 GiB 14.3% 0 Bytes candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
84.00 GiB 14.3% 0 Bytes objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
84.00 GiB 0.0% 0 Bytes _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
84.00 GiB 0.0% 0 Bytes objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
84.00 GiB 0.0% 0 Bytes objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
84.00 GiB 0.0% 0 Bytes _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
84.00 GiB 0.0% 0 Bytes -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
84.00 GiB 0.0% 0 Bytes -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
84.00 GiB 0.0% 0 Bytes -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
84.00 GiB 0.0% 0 Bytes -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
84.00 GiB 0.0% 0 Bytes -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
84.00 GiB 0.0% 84.00 GiB __kdebug_trace64
82.00 GiB 14.0% 0 Bytes candle_core::tensor::Tensor::broadcast_sub::heb9fb3e79a80bbd1
82.00 GiB 0.0% 0 Bytes candle_core::tensor::Tensor::sub::h571175f735b5dfbb
82.00 GiB 0.0% 0 Bytes candle_core::storage::Storage::binary_impl::h5d345597b9bcf69f
82.00 GiB 0.0% 0 Bytes _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h35f1c58c739baa69
82.00 GiB 0.0% 0 Bytes candle_core::metal_backend::MetalStorage::binary::h8a4fc6643c45827f
82.00 GiB 0.0% 0 Bytes _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
82.00 GiB 0.0% 0 Bytes candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
82.00 GiB 0.0% 0 Bytes candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
82.00 GiB 0.0% 0 Bytes objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
82.00 GiB 0.0% 0 Bytes _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
82.00 GiB 0.0% 0 Bytes objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
82.00 GiB 0.0% 0 Bytes objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
82.00 GiB 0.0% 0 Bytes _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
82.00 GiB 0.0% 0 Bytes -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
82.00 GiB 0.0% 0 Bytes -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
82.00 GiB 0.0% 0 Bytes -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
82.00 GiB 0.0% 0 Bytes -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
82.00 GiB 0.0% 0 Bytes -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
82.00 GiB 0.0% 82.00 GiB __kdebug_trace64
7.50 MiB 0.0% 0 Bytes candle_core::tensor::Tensor::sum_keepdim::h648a11aa78a7272d
7.50 MiB 0.0% 0 Bytes candle_core::tensor::Tensor::sum_impl::h56c45f280ce49025
7.50 MiB 0.0% 0 Bytes candle_core::storage::Storage::reduce_op::ha2d9bc02bdba85cb
7.50 MiB 0.0% 0 Bytes _$LT$candle_core..metal_backend..MetalStorage$u20$as$u20$candle_core..backend..BackendStorage$GT$::reduce_op::h6c1fa34e549af913
7.50 MiB 0.0% 0 Bytes _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
7.50 MiB 0.0% 0 Bytes candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
7.50 MiB 0.0% 0 Bytes candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
7.50 MiB 0.0% 0 Bytes objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
7.50 MiB 0.0% 0 Bytes _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
7.50 MiB 0.0% 0 Bytes objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
7.50 MiB 0.0% 0 Bytes objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
7.50 MiB 0.0% 0 Bytes _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
7.50 MiB 0.0% 0 Bytes -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
7.50 MiB 0.0% 0 Bytes -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
7.50 MiB 0.0% 0 Bytes -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
7.50 MiB 0.0% 0 Bytes -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
7.50 MiB 0.0% 0 Bytes -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
7.50 MiB 0.0% 7.50 MiB __kdebug_trace64
1.50 MiB 0.0% 0 Bytes candle_core::tensor::Tensor::max_keepdim::ha1543e08396d8976
1.50 MiB 0.0% 0 Bytes candle_core::tensor::Tensor::reduce_impl::h9b9a30dfacd26553
1.50 MiB 0.0% 0 Bytes candle_core::storage::Storage::reduce_op::ha2d9bc02bdba85cb
1.50 MiB 0.0% 0 Bytes _$LT$candle_core..metal_backend..MetalStorage$u20$as$u20$candle_core..backend..BackendStorage$GT$::reduce_op::h6c1fa34e549af913
1.50 MiB 0.0% 0 Bytes _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
1.50 MiB 0.0% 0 Bytes candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
1.50 MiB 0.0% 0 Bytes candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
1.50 MiB 0.0% 0 Bytes objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
1.50 MiB 0.0% 0 Bytes _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
1.50 MiB 0.0% 0 Bytes objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
1.50 MiB 0.0% 0 Bytes objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
1.50 MiB 0.0% 0 Bytes _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
1.50 MiB 0.0% 0 Bytes -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
1.50 MiB 0.0% 0 Bytes -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
1.50 MiB 0.0% 0 Bytes -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
1.50 MiB 0.0% 0 Bytes -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
1.50 MiB 0.0% 0 Bytes -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
1.50 MiB 0.0% 1.50 MiB __kdebug_trace64
76.00 GiB 13.0% 0 Bytes candle_core::tensor::Tensor::broadcast_mul::h87612ac4e7a9271d
76.00 GiB 0.0% 0 Bytes candle_core::tensor::Tensor::mul::h0369c368b543f4c8
76.00 GiB 0.0% 0 Bytes candle_core::storage::Storage::binary_impl::h3594d7ee262558eb
76.00 GiB 0.0% 0 Bytes _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h35f1c58c739baa69
76.00 GiB 0.0% 0 Bytes candle_core::metal_backend::MetalStorage::binary::h8a4fc6643c45827f
76.00 GiB 0.0% 0 Bytes _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
76.00 GiB 0.0% 0 Bytes candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
76.00 GiB 0.0% 0 Bytes candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
76.00 GiB 0.0% 0 Bytes objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
76.00 GiB 0.0% 0 Bytes _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
76.00 GiB 0.0% 0 Bytes objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
76.00 GiB 0.0% 0 Bytes objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
76.00 GiB 0.0% 0 Bytes _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
76.00 GiB 0.0% 0 Bytes -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
76.00 GiB 0.0% 0 Bytes -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
76.00 GiB 0.0% 0 Bytes -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
76.00 GiB 0.0% 0 Bytes -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
76.00 GiB 0.0% 0 Bytes -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
76.00 GiB 0.0% 76.00 GiB __kdebug_trace64
64.00 GiB 10.9% 0 Bytes candle_core::tensor::Tensor::matmul::h78ebc324e001cd34
64.00 GiB 0.0% 0 Bytes candle_core::storage::Storage::matmul::h63e24b756488a23b
64.00 GiB 0.0% 0 Bytes _$LT$candle_core..metal_backend..MetalStorage$u20$as$u20$candle_core..backend..BackendStorage$GT$::matmul::hcdf3ad195cd136a7
64.00 GiB 0.0% 0 Bytes _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
64.00 GiB 0.0% 0 Bytes candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
64.00 GiB 0.0% 0 Bytes candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
64.00 GiB 0.0% 0 Bytes objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
64.00 GiB 0.0% 0 Bytes _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
64.00 GiB 0.0% 0 Bytes objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
64.00 GiB 0.0% 0 Bytes objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
64.00 GiB 0.0% 0 Bytes _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
64.00 GiB 0.0% 0 Bytes -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
64.00 GiB 0.0% 0 Bytes -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
64.00 GiB 0.0% 0 Bytes -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
64.00 GiB 0.0% 0 Bytes -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
64.00 GiB 0.0% 0 Bytes -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
64.00 GiB 0.0% 64.00 GiB __kdebug_trace64
64.00 MiB 0.0% 0 Bytes candle_core::tensor::Tensor::reshape::h6f8d982e8766aede
64.00 MiB 0.0% 0 Bytes candle_core::device::Device::alloc_uninit::h6c7d3c31740be006
64.00 MiB 0.0% 0 Bytes candle_core::metal_backend::_$LT$impl$u20$candle_core..backend..BackendDevice$u20$for$u20$candle_core..metal_backend..device..MetalDevice$GT$::alloc_uninit::h3925f8c3eca38242
64.00 MiB 0.0% 0 Bytes _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
64.00 MiB 0.0% 0 Bytes candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
64.00 MiB 0.0% 0 Bytes candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
64.00 MiB 0.0% 0 Bytes objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
64.00 MiB 0.0% 0 Bytes _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
64.00 MiB 0.0% 0 Bytes objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
64.00 MiB 0.0% 0 Bytes objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
64.00 MiB 0.0% 0 Bytes _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
64.00 MiB 0.0% 0 Bytes -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
64.00 MiB 0.0% 0 Bytes -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
64.00 MiB 0.0% 0 Bytes -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
64.00 MiB 0.0% 0 Bytes -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
64.00 MiB 0.0% 0 Bytes -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
64.00 MiB 0.0% 64.00 MiB __kdebug_trace64
Each of these ops, on the Metal backend, allocates its output via:
MetalStorage::unary_impl/MetalStorage::binary/MetalStorage::matmulMetalDevice::new_buffer→MetalDevice::allocate_bufferDevice::new_buffer→MTLDevice::newBufferWithLength_options
Because command buffers are only flushed based on a fixed “kernels per command buffer” counter and we don’t track allocation volume, long sequences of these ops can allocate many large temporary buffers before any flush happens. Those buffers only become reusable after a distant flush, so peak RSS grows roughly with the sum of all intermediate outputs in a stage rather than the maximum.
Minimum exmaples to reproduce this issue
Cargo.toml
[dependencies]
candle-core = { git = "https://github.com/huggingface/candle", default-features = false,features = ["accelerate","metal"] }
candle-nn = { git = "https://github.com/huggingface/candle", default-features = false,features = ["accelerate","metal"] }
main.rs
use candle_core::{DType, Device, Result, Tensor, D};
fn main() -> Result<()> {
let dev = Device::new_metal(0)?;
let dtype = DType::BF16;
let batch = 1usize;
let num_heads = 12usize;
let head_dim = 1536 / num_heads;
let seq_len = 8192usize;
let total_len = seq_len;
let bh = batch * num_heads;
let q_flat = Tensor::randn(0f32, 1f32, (bh, seq_len, head_dim), &dev)?.to_dtype(dtype)?;
let k_tiled = Tensor::randn(0f32, 1f32, (bh, head_dim, total_len), &dev)?.to_dtype(dtype)?;
let v_flat = Tensor::randn(0f32, 1f32, (bh, total_len, head_dim), &dev)?.to_dtype(dtype)?;
let use_mask = true;
let mask = if use_mask {
Some(Tensor::randn(0f32, 1f32, (batch, 1, seq_len, total_len), &dev)?.to_dtype(dtype)?)
} else {
None
};
for _layer in 0..32 {
let mut scores = q_flat.matmul(&k_tiled)?;
let scale = 1.0f64 / (head_dim as f64).sqrt();
let scale_tensor = Tensor::full(scale as f32, (), scores.device())?.to_dtype(dtype)?;
scores = scores.broadcast_mul(&scale_tensor)?;
if let Some(mask) = &mask {
let expanded = mask
.expand((batch, num_heads, seq_len, total_len))?
.reshape((bh, seq_len, total_len))?;
scores = scores.add(&expanded)?;
}
let probs = candle_nn::ops::softmax(&scores, D::Minus1)?;
let _ctx = probs.matmul(&v_flat)?;
}
dev.synchronize()?;
Ok(())
}
Whats changed
This PR introduces a simple allocation policy for the Metal backend so that, once a configurable amount of new buffer memory has been allocated, we automatically synchronize and trim the reuse cache, giving the existing Metal buffer pooling a chance to recycle large temporaries and preventing runaway memory growth.
Look @pcuenca, it's the power of open source ✨ (Ref pointing out the memory spikes just the other day☺️)
This looks good to me. Just want to try it out in various contexts first :)
Indeed, I saw several softmax buffers when checking with Instruments!
It might be that sysctl_usize behaves correctly in all cases, if you want to get rid of the u64 -> usize conversions. Worth trying.
Btw, I can't guarantee the code I provided via the review is completely correct. It's all freehand :)
It might be that
sysctl_usizebehaves correctly in all cases, if you want to get rid of theu64 -> usizeconversions. Worth trying. Btw, I can't guarantee the code I provided via the review is completely correct. It's all freehand :)
Thanks for catching that. I'll look into removing those u64 to usize conversions first thing tomorrow. It's 8 AM here and I need to get some sleep, so I'd rather tackle this with a fresh mind.
I did a quick test, this is what I saw:
- The memory profile is much better behaved on macOS. This is what it looks like running Stable Diffusion:
The last spike is the VAE decoding phase; it's normal that memory grows. Memory is ~stable during the UNet forward steps. However, it increases a bit each step. This could point to some memory leak somewhere (including my test code).
- On iOS, I'm seeing deadlocks again. This may be a regression after the fixes in #3164 (not necessarily from this PR, I didn't check)
- I think we need to add
dep:libctocandle-core/Cargo.tomlfor themetalfeature.
I think we can merge this PR if we can determine that the iOS regression was introduced elsewhere; I'll run some more tests about that.
I think we need to add dep:libc to candle-core/Cargo.toml for the metal feature.
Rust std depends libc, so it's only when you're working with no_std targets that you would have to manually add the libc dependency (interesting situation where you both want no_std and libc, but it has probably happened plenty of times in the wild).
@TimmyOVO we haven't forgotten about you! :)
@pcuenca has been swamped recently so he hasn't had time to look into the issue he detected. Once he has time and we figure out the issue (if there is one) this will get merged asap
it's only when you're working with
no_stdtargets that you would have to manually add thelibcdependency
Also I this is a big fat lie. The actual answer is that all these changes are in candle-core which already has the libc dependency.