tvm icon indicating copy to clipboard operation
tvm copied to clipboard

[Bug][Metaschedule] Tuning trial hanging after

Open slyubomirsky opened this issue 2 years ago • 0 comments

I encountered this when trying to run this script over RPC on machines with v100's. Though it was done using Relax, @zxybazh says he thinks this can probably be triggered on mainline as well.

I ran ResNet-50 on V100 with an input shape of (1, 3, 224, 224), using 5 tuning trials. The tuning task began started hanging on the first tuning task, fused_conv2d_add_relu. It appeared that there were failures encountered during the task.

Output from the host:

  input_name: input0
  input_shape: [1, 3, 224, 224]
  input_dtype: float32
/home/ubuntu/tvm-runtime/python/tvm/driver/build_module.py:267: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
  warnings.warn(
INFO:tvm.meta_schedule.runner.rpc_runner:RPCRunner: max_workers = 2
INFO:tvm.meta_schedule.tune:Working directory: /home/ubuntu/dump/
2022-08-05 12:13:55.897 INFO Logging directory: /home/ubuntu/dump/logs
2022-08-05 12:13:55.897 INFO Working directory: /home/ubuntu/dump/
2022-08-05 12:13:55.898 INFO Creating JSONDatabase. Workload at: /home/ubuntu/dump/database_workload.json. Tuning records at: /home/ubuntu/dump/database_tuning_record.json
2022-08-05 12:13:56.063 INFO LocalBuilder: max_workers = 24
2022-08-05 12:13:56.388 INFO Initializing Task #0: "layout_transform"
2022-08-05 12:13:56.459 INFO Initializing Task #1: "fused_conv2d_add_relu"
2022-08-05 12:13:56.726 INFO Initializing Task #2: "max_pool2d"
2022-08-05 12:13:56.866 INFO Initializing Task #3: "fused_conv2d1_add1_relu1"
2022-08-05 12:13:57.114 INFO Initializing Task #4: "fused_contrib_conv2d_winograd_without_weight_transform_add1_relu1"
2022-08-05 12:13:58.024 INFO Initializing Task #5: "fused_conv2d2_add2"
2022-08-05 12:13:58.231 INFO Initializing Task #6: "fused_conv2d2_add2_add3_relu2"
2022-08-05 12:13:58.532 INFO Initializing Task #7: "fused_conv2d3_add1_relu1"
2022-08-05 12:13:58.784 INFO Initializing Task #8: "fused_conv2d4_add4_relu3"
2022-08-05 12:13:59.033 INFO Initializing Task #9: "fused_conv2d5_add5_relu4"
2022-08-05 12:13:59.301 INFO Initializing Task #10: "fused_conv2d7_add6"
2022-08-05 12:13:59.518 INFO Initializing Task #11: "fused_conv2d6_add6_add7_relu5"
2022-08-05 12:13:59.823 INFO Initializing Task #12: "fused_conv2d8_add5_relu4"
2022-08-05 12:14:00.077 INFO Initializing Task #13: "fused_contrib_conv2d_winograd_without_weight_transform1_add5_relu4"
2022-08-05 12:14:00.771 INFO Initializing Task #14: "fused_conv2d9_add8_relu6"
2022-08-05 12:14:01.022 INFO Initializing Task #15: "fused_conv2d10_add9_relu7"
2022-08-05 12:14:01.290 INFO Initializing Task #16: "fused_conv2d12_add10"
2022-08-05 12:14:01.504 INFO Initializing Task #17: "fused_conv2d11_add10_add11_relu8"
2022-08-05 12:14:01.806 INFO Initializing Task #18: "fused_conv2d13_add9_relu7"
2022-08-05 12:14:02.057 INFO Initializing Task #19: "fused_contrib_conv2d_winograd_without_weight_transform2_add9_relu7"
2022-08-05 12:14:02.753 INFO Initializing Task #20: "fused_conv2d14_add12_relu9"
2022-08-05 12:14:03.003 INFO Initializing Task #21: "fused_conv2d15_add13_relu10"
2022-08-05 12:14:03.272 INFO Initializing Task #22: "fused_conv2d17_add14"
2022-08-05 12:14:03.486 INFO Initializing Task #23: "fused_conv2d16_add14_add15_relu11"
2022-08-05 12:14:03.788 INFO Initializing Task #24: "fused_conv2d18_add13_relu10"
2022-08-05 12:14:04.039 INFO Initializing Task #25: "fused_contrib_conv2d_winograd_without_weight_transform3_add13_relu10"
2022-08-05 12:14:04.739 INFO Initializing Task #26: "adaptive_avg_pool2d"
2022-08-05 12:14:04.865 INFO Initializing Task #27: "fused_layout_transform1_reshape_squeeze"
2022-08-05 12:14:05.006 INFO Initializing Task #28: "fused_dense_add16"
2022-08-05 12:14:05.113 INFO 
 ID |                                                                 Name |      FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Terminated 
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0 |                                                     layout_transform |         1 |      1 |            N/A |          N/A |                   N/A |      0 |            
  1 |                                                fused_conv2d_add_relu | 237633536 |      1 |            N/A |          N/A |                   N/A |      0 |            
  2 |                                                           max_pool2d |   1806336 |      1 |            N/A |          N/A |                   N/A |      0 |            
  3 |                                             fused_conv2d1_add1_relu1 |  26091520 |      1 |            N/A |          N/A |                   N/A |      0 |            
  4 |    fused_contrib_conv2d_winograd_without_weight_transform_add1_relu1 | 128651264 |      3 |            N/A |          N/A |                   N/A |      0 |            
  5 |                                                   fused_conv2d2_add2 | 103563264 |      1 |            N/A |          N/A |                   N/A |      0 |            
  6 |                                        fused_conv2d2_add2_add3_relu2 | 105168896 |      3 |            N/A |          N/A |                   N/A |      0 |            
  7 |                                             fused_conv2d3_add1_relu1 | 103161856 |      2 |            N/A |          N/A |                   N/A |      0 |            
  8 |                                             fused_conv2d4_add4_relu3 | 206323712 |      1 |            N/A |          N/A |                   N/A |      0 |            
  9 |                                             fused_conv2d5_add5_relu4 | 231411712 |      1 |            N/A |          N/A |                   N/A |      0 |            
 10 |                                                   fused_conv2d7_add6 | 205922304 |      1 |            N/A |          N/A |                   N/A |      0 |            
 11 |                                        fused_conv2d6_add6_add7_relu5 | 103964672 |      4 |            N/A |          N/A |                   N/A |      0 |            
 12 |                                             fused_conv2d8_add5_relu4 | 102961152 |      3 |            N/A |          N/A |                   N/A |      0 |            
 13 |   fused_contrib_conv2d_winograd_without_weight_transform1_add5_relu4 | 127045632 |      3 |            N/A |          N/A |                   N/A |      0 |            
 14 |                                             fused_conv2d9_add8_relu6 | 205922304 |      1 |            N/A |          N/A |                   N/A |      0 |            
 15 |                                            fused_conv2d10_add9_relu7 | 231311360 |      1 |            N/A |          N/A |                   N/A |      0 |            
 16 |                                                 fused_conv2d12_add10 | 205721600 |      1 |            N/A |          N/A |                   N/A |      0 |            
 17 |                                     fused_conv2d11_add10_add11_relu8 | 103362560 |      6 |            N/A |          N/A |                   N/A |      0 |            
 18 |                                            fused_conv2d13_add9_relu7 | 102860800 |      5 |            N/A |          N/A |                   N/A |      0 |            
 19 |   fused_contrib_conv2d_winograd_without_weight_transform2_add9_relu7 | 114903040 |      5 |            N/A |          N/A |                   N/A |      0 |            
 20 |                                           fused_conv2d14_add12_relu9 | 205721600 |      1 |            N/A |          N/A |                   N/A |      0 |            
 21 |                                          fused_conv2d15_add13_relu10 | 231261184 |      1 |            N/A |          N/A |                   N/A |      0 |            
 22 |                                                 fused_conv2d17_add14 | 205621248 |      1 |            N/A |          N/A |                   N/A |      0 |            
 23 |                                    fused_conv2d16_add14_add15_relu11 | 103061504 |      3 |            N/A |          N/A |                   N/A |      0 |            
 24 |                                          fused_conv2d18_add13_relu10 | 102810624 |      2 |            N/A |          N/A |                   N/A |      0 |            
 25 | fused_contrib_conv2d_winograd_without_weight_transform3_add13_relu10 | 142132224 |      2 |            N/A |          N/A |                   N/A |      0 |            
 26 |                                                  adaptive_avg_pool2d |    102400 |      1 |            N/A |          N/A |                   N/A |      0 |            
 27 |                              fused_layout_transform1_reshape_squeeze |         1 |      1 |            N/A |          N/A |                   N/A |      0 |            
 28 |                                                    fused_dense_add16 |   4097000 |      1 |            N/A |          N/A |                   N/A |      0 |            
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 0
Total latency (us): 0

2022-08-05 12:14:05.114 INFO Scheduler picks Task #0: "layout_transform"
2022-08-05 12:14:06.380 INFO Sending 6 sample(s) to builder
2022-08-05 12:14:06.713 INFO Sending 6 sample(s) to runner
2022-08-05 12:14:06.713 INFO Scheduler picks Task #1: "fused_conv2d_add_relu"

The tail of the long of task 1 (excerpted, as it goes on for a long time):

[etc]
2022-08-05 12:36:14.188 INFO Sample-Init-Population summary:
Postproc #0 [meta_schedule.DisallowDynamicLoop(0x423d1d8)]: 0 failure(s)
Postproc #1 [meta_schedule.RewriteCooperativeFetch(0xf2ad228)]: 0 failure(s)
Postproc #2 [meta_schedule.RewriteUnboundBlock(0xf2ad258)]: 0 failure(s)
Postproc #3 [meta_schedule.RewriteParallelVectorizeUnroll(0x3d6c8e8)]: 0 failure(s)
Postproc #4 [meta_schedule.RewriteReductionBlock(0x4d449e8)]: 0 failure(s)
Postproc #5 [meta_schedule.VerifyGPUCode(0x45933f8)]: 1685504 failure(s)
2022-08-05 12:36:15.803 INFO Sample-Init-Population summary:
Postproc #0 [meta_schedule.DisallowDynamicLoop(0x423d1d8)]: 0 failure(s)
Postproc #1 [meta_schedule.RewriteCooperativeFetch(0xf2ad228)]: 0 failure(s)
Postproc #2 [meta_schedule.RewriteUnboundBlock(0xf2ad258)]: 0 failure(s)
Postproc #3 [meta_schedule.RewriteParallelVectorizeUnroll(0x3d6c8e8)]: 0 failure(s)
Postproc #4 [meta_schedule.RewriteReductionBlock(0x4d449e8)]: 0 failure(s)
Postproc #5 [meta_schedule.VerifyGPUCode(0x45933f8)]: 1687552 failure(s)
2022-08-05 12:36:17.411 INFO Sample-Init-Population summary:
Postproc #0 [meta_schedule.DisallowDynamicLoop(0x423d1d8)]: 0 failure(s)
Postproc #1 [meta_schedule.RewriteCooperativeFetch(0xf2ad228)]: 0 failure(s)
Postproc #2 [meta_schedule.RewriteUnboundBlock(0xf2ad258)]: 0 failure(s)
Postproc #3 [meta_schedule.RewriteParallelVectorizeUnroll(0x3d6c8e8)]: 0 failure(s)
Postproc #4 [meta_schedule.RewriteReductionBlock(0x4d449e8)]: 0 failure(s)
Postproc #5 [meta_schedule.VerifyGPUCode(0x45933f8)]: 1689600 failure(s)

slyubomirsky avatar Aug 05 '22 22:08 slyubomirsky