chirp Registers are spilled to local memory on calculating embeddings

I'm trying to find a solution to a memory spill issue. I am running perch on a few 100 GB of audio. When I run `perch/embed_audio.ipynb I end up with a lot of spills into local memory and it's not an issue I've had to trace before:

Environment:

Python 3.11.9
GCC 11.2.0
NVIDIA-SMI 535.161.08
Driver Version: 535.161.08
CUDA Version: 12.2
Tensorflow 2.16.1
OS: Ubuntu 22.04.4 LTS
VM: Azure Standard NC24ads A100 v4
RAM 220 GB
CPU 24x vCPU AMD EPYC™ 7V13 (Milan)
GPU A100 80GB PCIe GPU card

I0000 00:00:1718394013.781000    6517 asm_compiler.cc:369] ptxas warning : Registers are spilled to local memory in function 'triton_gemm_dot_2', 24 bytes spill stores, 24 bytes spill loads

I0000 00:00:1718394013.794912    6502 asm_compiler.cc:369] ptxas warning : Registers are spilled to local memory in function 'triton_gemm_dot_3753', 52 bytes spill stores, 52 bytes spill loads

I0000 00:00:1718394013.866065    6524 asm_compiler.cc:369] ptxas warning : Registers are spilled to local memory in function 'triton_gemm_dot_3753', 220 bytes spill stores, 220 bytes spill loads

I0000 00:00:1718394014.030538    6506 asm_compiler.cc:369] ptxas warning : Registers are spilled to local memory in function 'triton_gemm_dot_2', 472 bytes spill stores, 304 bytes spill loads

This leads to the following slow execution errors but the script continues with register spilling errors without falling over but running very slowly. Could anyone suggest some pointers to solve this? Many thanks.


2024-06-14 19:40:16.455757: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[719,640,501,1]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,1,160640,1]{3,2,1,0}, f32[640,1,640,1]{3,2,1,0}), window={size=640x1 stride=320x1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:18.707113: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 3.251487411s
Trying algorithm eng0{} for conv (f32[719,640,501,1]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,1,160640,1]{3,2,1,0}, f32[640,1,640,1]{3,2,1,0}), window={size=640x1 stride=320x1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:20.156609: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng4{} for conv (f32[719,160,500,1]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,160,755,1]{3,2,1,0}, f32[160,1,256,1]{3,2,1,0}), window={size=256x1}, dim_labels=bf01_oi01->bf01, feature_group_count=160, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:22.419095: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 3.262581059s
Trying algorithm eng4{} for conv (f32[719,160,500,1]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,160,755,1]{3,2,1,0}, f32[160,1,256,1]{3,2,1,0}), window={size=256x1}, dim_labels=bf01_oi01->bf01, feature_group_count=160, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:28.401848: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng46{k2=5,k5=3,k14=4} for conv (f32[719,32,249,79]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,32,249,79]{3,2,1,0}, f32[32,1,3,3]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, feature_group_count=32, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:28.672951: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.271200136s
Trying algorithm eng46{k2=5,k5=3,k14=4} for conv (f32[719,32,249,79]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,32,249,79]{3,2,1,0}, f32[32,1,3,3]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, feature_group_count=32, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:37.274820: W external/local_tsl/tsl/framework/bfc_allocator.cc:368] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2024-06-14 19:40:38.274947: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng3{k11=0} for conv (f32[719,96,249,79]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,16,249,79]{3,2,1,0}, f32[96,16,1,1]{3,2,1,0}), window={size=1x1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:38.595592: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.320788364s
Trying algorithm eng3{k11=0} for conv (f32[719,96,249,79]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,16,249,79]{3,2,1,0}, f32[96,16,1,1]{3,2,1,0}), window={size=1x1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:41.298653: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng46{k2=5,k5=3,k14=4} for conv (f32[719,96,125,40]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,96,249,79]{3,2,1,0}, f32[96,1,3,3]{3,2,1,0}), window={size=3x3 stride=2x2 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, feature_group_count=96, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:41.485093: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.186534591s
Trying algorithm eng46{k2=5,k5=3,k14=4} for conv (f32[719,96,125,40]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,96,249,79]{3,2,1,0}, f32[96,1,3,3]{3,2,1,0}), window={size=3x3 stride=2x2 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, feature_group_count=96, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:42.485235: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng4{} for conv (f32[719,96,125,40]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,96,249,79]{3,2,1,0}, f32[96,1,3,3]{3,2,1,0}), window={size=3x3 stride=2x2 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, feature_group_count=96, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:42.517285: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.032144481s
Trying algorithm eng4{} for conv (f32[719,96,125,40]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,96,249,79]{3,2,1,0}, f32[96,1,3,3]{3,2,1,0}), window={size=3x3 stride=2x2 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, feature_group_count=96, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:47.752097: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng46{k2=5,k5=3,k14=4} for conv (f32[719,144,125,40]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,144,125,40]{3,2,1,0}, f32[144,1,3,3]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, feature_group_count=144, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:48.225649: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.473644391s
Trying algorithm eng46{k2=5,k5=3,k14=4} for conv (f32[719,144,125,40]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,144,125,40]{3,2,1,0}, f32[144,1,3,3]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, feature_group_count=144, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
  0%|          | 6/21119 [01:10<84:44:35, 14.45s/it] W0000 00:00:1718394066.767582    4752 assert_op.cc:38] Ignoring Assert operator jax2tf_infer_fn_/assert_equal_1/Assert/AssertGuard/Assert

Jun 14 '24 20:06 joshctaylor

Hi, Josh!

This sounds like one of the many issues we've had with TF 2.16. If you're running locally, you could try downgrading to TF 2.15 and seeing if it mitigates the problem. Edit the toml file to change the dependency, then run poetry lock and poetry update (IIRC) to switch to TF 2.15.

Jun 20 '24 17:06 sdenton4

Hi, Josh!

This sounds like one of the many issues we've had with TF 2.16. If you're running locally, you could try downgrading to TF 2.15 and seeing if it mitigates the problem. Edit the toml file to change the dependency, then run poetry lock and poetry update (IIRC) to switch to TF 2.15.

Thanks for the tip - this worked well. I realised that TF 2.16 should have Cuda 12.3 and didn't test this.

Aug 13 '24 16:08 joshctaylor

Ah, that makes sense - thanks for following up.

Aug 13 '24 17:08 sdenton4

Ah, that makes sense - thanks for following up.

No problem - just tested cuda 12.3 on tensorflow 2.16.2 and the overflow issue is still there.

Aug 13 '24 20:08 joshctaylor

Tested Python 2.16 and 2.18 on cuda 12.2 and the issue is still there.

Nov 21 '24 22:11 davidenoma

Googling around a bit suggests that this is due to a mismatch between the TF and CUDA versions. The current TF version is built against CUDA 12.3.

https://www.tensorflow.org/install/pip#software_requirements

Nov 22 '24 08:11 sdenton4

Googling around a bit suggests that this is due to a mismatch between the TF and CUDA versions. The current TF version is built against CUDA 12.3.

As of now, we see the table that tells us that TF 2.18 was tested against CUDA 12.5 (and it defaults to that library during fresh installation as far as I understand), though I constantly receive this error in my epochs.

Jan 15 '25 01:01 pro100jekon

Per the pyproject.toml file, we're currently building against tensorflow 2.15:

tensorflow = "^2.15"

We can update to 2.18 relatively easily, if that's an issue.

Jan 15 '25 17:01 sdenton4

Tested Tensorflow 2.16 on cuda 12.8 and the issue is still there. Environment:

Tensorflow 2.16.1
Python: 3.12.2
GCC:    11.4.0
NVIDIA-SMI:     565.77
Driver Version: 565.77
CUDA Version: 12.8 (Cuda compilation tools V12.8.61)
OS: Ubuntu 22.04.4 LTS (GNU/Linux 6.8.0-52-generic x86_64)
GPU: NVIDIA RTX A6000 48GB PCIe GPU card

Aug 05 '25 10:08 Reformatsky