keras-cv icon indicating copy to clipboard operation
keras-cv copied to clipboard

Use jit_compile=False for CenterNet to avoid XLA compilation during training

Open TillBeemelmanns opened this issue 3 months ago • 1 comments

I am running into problems when using keras-cv/examples/training/object_detection_3d/waymo/train_pillars.py with Keras 3. Some of the layers constantly recompile XLA (probably the voxelization layer) causing very long step time and OOM crash. Withjit_compile=False the problem does not appear.

  • keras-cv==0.8.2
  • keras==3.1.1
  • tensorflow==2.16.1

jit_compile=True

Model: "multi_head_center_pillar"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                  ┃ Output Shape              ┃         Param # ┃ Connected to               ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ point_mask (InputLayer)       │ (None, None, 1)           │               0 │ -                          │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ point_xyz (InputLayer)        │ (None, None, 3)           │               0 │ -                          │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ point_feature (InputLayer)    │ (None, None, 1)           │               0 │ -                          │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ get_item (GetItem)            │ (None, None)              │               0 │ point_mask[0][0]           │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ dynamic_voxelization          │ (None, 512, 512, 128)     │           1,152 │ point_xyz[0][0],           │
│ (DynamicVoxelization)         │                           │                 │ point_feature[0][0],       │
│                               │                           │                 │ get_item[0][0]             │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ center_pillar_backbone        │ (None, 512, 512, 256)     │      19,286,656 │ dynamic_voxelization[0][0] │
│ (CenterPillarBackbone)        │                           │                 │                            │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ detection_head                │ [(None, 512, 512, 32),    │          12,336 │ center_pillar_backbone[0]… │
│ (MultiClassDetectionHead)     │ (None, 512, 512, 16)]     │                 │                            │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ box_class_1 (Identity)        │ (None, 512, 512, 32)      │               0 │ detection_head[0][0]       │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ box_class_2 (Identity)        │ (None, 512, 512, 16)      │               0 │ detection_head[0][1]       │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ heatmap_class_1 (Identity)    │ (None, 512, 512, 32)      │               0 │ detection_head[0][0]       │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ heatmap_class_2 (Identity)    │ (None, 512, 512, 16)      │               0 │ detection_head[0][1]       │
└───────────────────────────────┴───────────────────────────┴─────────────────┴────────────────────────────┘
 Total params: 19,300,144 (73.62 MB)
 Trainable params: 19,282,224 (73.56 MB)
 Non-trainable params: 17,920 (70.00 KB)
Epoch 1/50

Epoch 1: LearningRateScheduler setting learning rate to 0.0009048374610134307.
I0000 00:00:1711905600.921507 2492871 service.cc:145] XLA service 0x55e59d2bf900 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1711905600.921579 2492871 service.cc:153]   StreamExecutor device (0): NVIDIA A100-SXM4-40GB, Compute Capability 8.0
2024-03-31 17:20:14.187293: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable.
2024-03-31 17:20:22.313706: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Constant folding an instruction is taking > 1s:

  %divide.2635 = f32[4,199600,3]{2,1,0} divide(f32[4,199600,3]{2,1,0} %constant.2632, f32[4,199600,3]{2,1,0} %broadcast.2634), metadata={op_type="RealDiv" op_name="multi_head_center_pillar_1/dynamic_voxelization_1/point_to_voxel_1/truediv" source_file="/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py" source_line=1177}

This isn't necessarily a bug; constant-folding is inherently a trade-off between compilation time and speed at runtime. XLA has some guards that attempt to keep constant folding from taking too long, but fundamentally you'll always be able to come up with an input program that takes a long time.

If you'd like to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
2024-03-31 17:20:24.539890: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 3.22656167s
Constant folding an instruction is taking > 1s:

  %divide.2635 = f32[4,199600,3]{2,1,0} divide(f32[4,199600,3]{2,1,0} %constant.2632, f32[4,199600,3]{2,1,0} %broadcast.2634), metadata={op_type="RealDiv" op_name="multi_head_center_pillar_1/dynamic_voxelization_1/point_to_voxel_1/truediv" source_file="/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py" source_line=1177}

This isn't necessarily a bug; constant-folding is inherently a trade-off between compilation time and speed at runtime. XLA has some guards that attempt to keep constant folding from taking too long, but fundamentally you'll always be able to come up with an input program that takes a long time.

If you'd like to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
2024-03-31 17:20:26.540595: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Constant folding an instruction is taking > 2s:

  %add.2638 = f32[4,199600,3]{2,1,0} add(f32[4,199600,3]{2,1,0} %constant.116, f32[4,199600,3]{2,1,0} %broadcast.2637), metadata={op_type="AddV2" op_name="multi_head_center_pillar_1/dynamic_voxelization_1/point_to_voxel_1/add" source_file="/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py" source_line=1177}

This isn't necessarily a bug; constant-folding is inherently a trade-off between compilation time and speed at runtime. XLA has some guards that attempt to keep constant folding from taking too long, but fundamentally you'll always be able to come up with an input program that takes a long time.

If you'd like to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
2024-03-31 17:20:27.899858: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 3.359643651s
Constant folding an instruction is taking > 2s:

  %add.2638 = f32[4,199600,3]{2,1,0} add(f32[4,199600,3]{2,1,0} %constant.116, f32[4,199600,3]{2,1,0} %broadcast.2637), metadata={op_type="AddV2" op_name="multi_head_center_pillar_1/dynamic_voxelization_1/point_to_voxel_1/add" source_file="/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/ops.py" source_line=1177}

This isn't necessarily a bug; constant-folding is inherently a trade-off between compilation time and speed at runtime. XLA has some guards that attempt to keep constant folding from taking too long, but fundamentally you'll always be able to come up with an input program that takes a long time.

If you'd like to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
2024-03-31 17:21:02.275112: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[4,128,512,512]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,128,512,512]{3,2,1,0}, f32[128,128,3,3]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:02.328674: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.053667725s
Trying algorithm eng0{} for conv (f32[4,128,512,512]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,128,512,512]{3,2,1,0}, f32[128,128,3,3]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:05.057390: W external/local_tsl/tsl/framework/bfc_allocator.cc:368] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2024-03-31 17:21:06.057462: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng3{k11=0} for conv (f32[4,256,128,128]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,128,128]{3,2,1,0}, f32[256,256,3,3]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:06.214816: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.157470019s
Trying algorithm eng3{k11=0} for conv (f32[4,256,128,128]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,128,128]{3,2,1,0}, f32[256,256,3,3]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:10.765531: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 130.03GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-03-31 17:21:12.544249: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[4,256,257,257]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,512,128,128]{3,2,1,0}, f32[512,256,3,3]{3,2,1,0}), window={size=3x3 stride=2x2}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardInput", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:13.134092: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.58994098s
Trying algorithm eng0{} for conv (f32[4,256,257,257]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,512,128,128]{3,2,1,0}, f32[512,256,3,3]{3,2,1,0}), window={size=3x3 stride=2x2}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardInput", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:17.903658: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[4,256,513,513]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,256,256]{3,2,1,0}, f32[256,256,3,3]{3,2,1,0}), window={size=3x3 stride=2x2}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardInput", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:19.991362: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 3.087802008s
Trying algorithm eng0{} for conv (f32[4,256,513,513]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,256,256]{3,2,1,0}, f32[256,256,3,3]{3,2,1,0}), window={size=3x3 stride=2x2}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardInput", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:25.011483: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[4,256,512,512]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,512,512]{3,2,1,0}, f32[256,256,3,3]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:28.214537: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 4.203145385s
Trying algorithm eng0{} for conv (f32[4,256,512,512]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,512,512]{3,2,1,0}, f32[256,256,3,3]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:35.614430: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[4,256,256,256]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,513,513]{3,2,1,0}, f32[256,256,3,3]{3,2,1,0}), window={size=3x3 stride=2x2}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:35.658271: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.044057506s
Trying algorithm eng0{} for conv (f32[4,256,256,256]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,513,513]{3,2,1,0}, f32[256,256,3,3]{3,2,1,0}), window={size=3x3 stride=2x2}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:47.787391: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[128,128,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,128,512,512]{3,2,1,0}, f32[4,128,512,512]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:47.887323: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.099978354s
Trying algorithm eng0{} for conv (f32[128,128,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,128,512,512]{3,2,1,0}, f32[4,128,512,512]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:55.627222: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[512,512,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,512,128,128]{3,2,1,0}, f32[4,512,128,128]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:55.711771: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.084633684s
Trying algorithm eng0{} for conv (f32[512,512,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,512,128,128]{3,2,1,0}, f32[4,512,128,128]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:55.874372: W external/local_tsl/tsl/framework/bfc_allocator.cc:296] Allocator (GPU_0_bfc) ran out of memory trying to allocate 129.14GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2024-03-31 17:21:59.427328: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[256,256,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,256,256]{3,2,1,0}, f32[4,256,256,256]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:21:59.512810: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.085548719s
Trying algorithm eng0{} for conv (f32[256,256,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,256,256]{3,2,1,0}, f32[4,256,256,256]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:22:03.639446: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[256,256,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,513,513]{3,2,1,0}, f32[4,256,256,256]{3,2,1,0}), window={size=3x3 stride=2x2}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:22:03.840970: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 1.201576719s
Trying algorithm eng0{} for conv (f32[256,256,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,513,513]{3,2,1,0}, f32[4,256,256,256]{3,2,1,0}), window={size=3x3 stride=2x2}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:22:08.992589: E external/local_xla/xla/service/slow_operation_alarm.cc:65] Trying algorithm eng0{} for conv (f32[256,256,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,512,512]{3,2,1,0}, f32[4,256,512,512]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
2024-03-31 17:22:12.369822: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 4.377403431s
Trying algorithm eng0{} for conv (f32[256,256,3,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,512,512]{3,2,1,0}, f32[4,256,512,512]{3,2,1,0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBackwardFilter", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}} is taking a while...
    4/39520 ━━━━━━━━━━━━━━━━━━━━ 484:59:51 44s/step - loss: 1106.45762024-03-31 17:25:18.613809: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1566] failed to allocate 576.0KiB (589824 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

jit_compile=False

Model: "multi_head_center_pillar"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                  ┃ Output Shape              ┃         Param # ┃ Connected to               ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ point_mask (InputLayer)       │ (None, None, 1)           │               0 │ -                          │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ point_xyz (InputLayer)        │ (None, None, 3)           │               0 │ -                          │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ point_feature (InputLayer)    │ (None, None, 1)           │               0 │ -                          │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ get_item (GetItem)            │ (None, None)              │               0 │ point_mask[0][0]           │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ dynamic_voxelization          │ (None, 512, 512, 128)     │           1,152 │ point_xyz[0][0],           │
│ (DynamicVoxelization)         │                           │                 │ point_feature[0][0],       │
│                               │                           │                 │ get_item[0][0]             │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ center_pillar_backbone        │ (None, 512, 512, 256)     │      19,286,656 │ dynamic_voxelization[0][0] │
│ (CenterPillarBackbone)        │                           │                 │                            │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ detection_head                │ [(None, 512, 512, 32),    │          12,336 │ center_pillar_backbone[0]… │
│ (MultiClassDetectionHead)     │ (None, 512, 512, 16)]     │                 │                            │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ box_class_1 (Identity)        │ (None, 512, 512, 32)      │               0 │ detection_head[0][0]       │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ box_class_2 (Identity)        │ (None, 512, 512, 16)      │               0 │ detection_head[0][1]       │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ heatmap_class_1 (Identity)    │ (None, 512, 512, 32)      │               0 │ detection_head[0][0]       │
├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤
│ heatmap_class_2 (Identity)    │ (None, 512, 512, 16)      │               0 │ detection_head[0][1]       │
└───────────────────────────────┴───────────────────────────┴─────────────────┴────────────────────────────┘
 Total params: 19,300,144 (73.62 MB)
 Trainable params: 19,282,224 (73.56 MB)
 Non-trainable params: 17,920 (70.00 KB)
Epoch 1/50

Epoch 1: LearningRateScheduler setting learning rate to 0.0009048374610134307.
  149/39520 ━━━━━━━━━━━━━━━━━━━━ 4:04:26 373ms/step - loss: 270.9218  

@divyashreepathihalli @sampathweb

TillBeemelmanns avatar Mar 31 '24 17:03 TillBeemelmanns