Registers are spilled to local memory on calculating embeddings
I'm trying to find a solution to a memory spill issue. I am running perch on a few 100 GB of audio. When I run `perch/embed_audio.ipynb I end up with a lot of spills into local memory and it's not an issue I've had to trace before:
Environment:
- Python 3.11.9
- GCC 11.2.0
- NVIDIA-SMI 535.161.08
- Driver Version: 535.161.08
- CUDA Version: 12.2
- Tensorflow 2.16.1
- OS: Ubuntu 22.04.4 LTS
- VM: Azure Standard NC24ads A100 v4
- RAM 220 GB
- CPU 24x vCPU AMD EPYC™ 7V13 (Milan)
- GPU A100 80GB PCIe GPU card
I0000 00:00:1718394013.781000 6517 asm_compiler.cc:369] ptxas warning : Registers are spilled to local memory in function 'triton_gemm_dot_2', 24 bytes spill stores, 24 bytes spill loads
I0000 00:00:1718394013.794912 6502 asm_compiler.cc:369] ptxas warning : Registers are spilled to local memory in function 'triton_gemm_dot_3753', 52 bytes spill stores, 52 bytes spill loads
I0000 00:00:1718394013.866065 6524 asm_compiler.cc:369] ptxas warning : Registers are spilled to local memory in function 'triton_gemm_dot_3753', 220 bytes spill stores, 220 bytes spill loads
I0000 00:00:1718394014.030538 6506 asm_compiler.cc:369] ptxas warning : Registers are spilled to local memory in function 'triton_gemm_dot_2', 472 bytes spill stores, 304 bytes spill loads
This leads to the following slow execution errors but the script continues with register spilling errors without falling over but running very slowly. Could anyone suggest some pointers to solve this? Many thanks.
2024-06-14 19:40:16.455757: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[719,640,501,1]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,1,160640,1]{3,2,1,0}, f32[640,1,640,1]{3,2,1,0}), window={size=640x1 stride=320x1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:18.707113: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 3.251487411s
Trying algorithm eng0{} for conv (f32[719,640,501,1]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,1,160640,1]{3,2,1,0}, f32[640,1,640,1]{3,2,1,0}), window={size=640x1 stride=320x1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:20.156609: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng4{} for conv (f32[719,160,500,1]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,160,755,1]{3,2,1,0}, f32[160,1,256,1]{3,2,1,0}), window={size=256x1}, dim_labels=bf01_oi01->bf01, feature_group_count=160, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:22.419095: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 3.262581059s
Trying algorithm eng4{} for conv (f32[719,160,500,1]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,160,755,1]{3,2,1,0}, f32[160,1,256,1]{3,2,1,0}), window={size=256x1}, dim_labels=bf01_oi01->bf01, feature_group_count=160, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:28.401848: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng46{k2=5,k5=3,k14=4} for conv (f32[719,32,249,79]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,32,249,79]{3,2,1,0}, f32[32,1,3,3]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, feature_group_count=32, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:28.672951: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.271200136s
Trying algorithm eng46{k2=5,k5=3,k14=4} for conv (f32[719,32,249,79]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,32,249,79]{3,2,1,0}, f32[32,1,3,3]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, feature_group_count=32, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:37.274820: W external/local_tsl/tsl/framework/bfc_allocator.cc:368] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2024-06-14 19:40:38.274947: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng3{k11=0} for conv (f32[719,96,249,79]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,16,249,79]{3,2,1,0}, f32[96,16,1,1]{3,2,1,0}), window={size=1x1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:38.595592: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.320788364s
Trying algorithm eng3{k11=0} for conv (f32[719,96,249,79]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,16,249,79]{3,2,1,0}, f32[96,16,1,1]{3,2,1,0}), window={size=1x1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:41.298653: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng46{k2=5,k5=3,k14=4} for conv (f32[719,96,125,40]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,96,249,79]{3,2,1,0}, f32[96,1,3,3]{3,2,1,0}), window={size=3x3 stride=2x2 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, feature_group_count=96, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:41.485093: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.186534591s
Trying algorithm eng46{k2=5,k5=3,k14=4} for conv (f32[719,96,125,40]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,96,249,79]{3,2,1,0}, f32[96,1,3,3]{3,2,1,0}), window={size=3x3 stride=2x2 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, feature_group_count=96, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:42.485235: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng4{} for conv (f32[719,96,125,40]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,96,249,79]{3,2,1,0}, f32[96,1,3,3]{3,2,1,0}), window={size=3x3 stride=2x2 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, feature_group_count=96, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:42.517285: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.032144481s
Trying algorithm eng4{} for conv (f32[719,96,125,40]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,96,249,79]{3,2,1,0}, f32[96,1,3,3]{3,2,1,0}), window={size=3x3 stride=2x2 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, feature_group_count=96, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:47.752097: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng46{k2=5,k5=3,k14=4} for conv (f32[719,144,125,40]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,144,125,40]{3,2,1,0}, f32[144,1,3,3]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, feature_group_count=144, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-06-14 19:40:48.225649: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.473644391s
Trying algorithm eng46{k2=5,k5=3,k14=4} for conv (f32[719,144,125,40]{3,2,1,0}, u8[0]{0}) custom-call(f32[719,144,125,40]{3,2,1,0}, f32[144,1,3,3]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, feature_group_count=144, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
0%| | 6/21119 [01:10<84:44:35, 14.45s/it] W0000 00:00:1718394066.767582 4752 assert_op.cc:38] Ignoring Assert operator jax2tf_infer_fn_/assert_equal_1/Assert/AssertGuard/Assert
Hi, Josh!
This sounds like one of the many issues we've had with TF 2.16. If you're running locally, you could try downgrading to TF 2.15 and seeing if it mitigates the problem. Edit the toml file to change the dependency, then run poetry lock and poetry update (IIRC) to switch to TF 2.15.
Hi, Josh!
This sounds like one of the many issues we've had with TF 2.16. If you're running locally, you could try downgrading to TF 2.15 and seeing if it mitigates the problem. Edit the toml file to change the dependency, then run
poetry lockandpoetry update(IIRC) to switch to TF 2.15.
Thanks for the tip - this worked well. I realised that TF 2.16 should have Cuda 12.3 and didn't test this.
Ah, that makes sense - thanks for following up.
Ah, that makes sense - thanks for following up.
No problem - just tested cuda 12.3 on tensorflow 2.16.2 and the overflow issue is still there.
Tested Python 2.16 and 2.18 on cuda 12.2 and the issue is still there.
Googling around a bit suggests that this is due to a mismatch between the TF and CUDA versions. The current TF version is built against CUDA 12.3.
https://www.tensorflow.org/install/pip#software_requirements
Googling around a bit suggests that this is due to a mismatch between the TF and CUDA versions. The current TF version is built against CUDA 12.3.
As of now, we see the table that tells us that TF 2.18 was tested against CUDA 12.5 (and it defaults to that library during fresh installation as far as I understand), though I constantly receive this error in my epochs.
Per the pyproject.toml file, we're currently building against tensorflow 2.15:
tensorflow = "^2.15"
We can update to 2.18 relatively easily, if that's an issue.
Tested Tensorflow 2.16 on cuda 12.8 and the issue is still there. Environment:
Tensorflow 2.16.1
Python: 3.12.2
GCC: 11.4.0
NVIDIA-SMI: 565.77
Driver Version: 565.77
CUDA Version: 12.8 (Cuda compilation tools V12.8.61)
OS: Ubuntu 22.04.4 LTS (GNU/Linux 6.8.0-52-generic x86_64)
GPU: NVIDIA RTX A6000 48GB PCIe GPU card