MIOpen icon indicating copy to clipboard operation
MIOpen copied to clipboard

Experiment with XNACK+ on MI210

Open junliume opened this issue 10 months ago • 13 comments

junliume avatar Apr 01 '24 18:04 junliume

@junliume We need some info from @Kirpich30000 about configs that can be enabled for Winograd when XNACK is ON. There are some, at least. We discussed this topic with Ilya yesterday.

atamazov avatar Apr 12 '24 18:04 atamazov

Related issue: #2865

atamazov avatar Apr 12 '24 20:04 atamazov

@cderb it seems that somehow #2870 is not effective in this PR's CI?

junliume avatar Apr 13 '24 05:04 junliume

@cderb it seems that somehow #2870 is not effective in this PR's CI?

I'll clean up here. Starting debug.

cderb avatar Apr 15 '24 18:04 cderb

@cderb it seems that somehow #2870 is not effective in this PR's CI?

I'll clean up here. Starting debug.

@cderb a few tests are still failing the stage

junliume avatar Apr 16 '24 05:04 junliume

@junliume It appears there are additional tests above what was addressed in #2870 which will hang when xnack is set manually in the environment by HSA_XNACK=1.

cderb avatar Apr 16 '24 21:04 cderb

@junliume It appears there are additional tests above what was addressed in #2870 which will hang when xnack is set manually in the environment by HSA_XNACK=1.

@cderb should we skip them?

junliume avatar Apr 16 '24 21:04 junliume

@junliume It appears there are additional tests above what was addressed in #2870 which will hang when xnack is set manually in the environment by HSA_XNACK=1.

@cderb should we skip them?

Looks like there are 14 tests affected. I'm not sure whether to skip them or not.

[2024-04-15T23:11:47.404Z] 	  1 - test_activation (Failed)
[2024-04-15T23:11:47.404Z] 	  5 - test_bn_peract_test (Failed)
[2024-04-15T23:11:47.404Z] 	  6 - test_bn_spatial_test (Failed)
[2024-04-15T23:11:47.404Z] 	 16 - test_ctc (Failed)
[2024-04-15T23:11:47.404Z] 	 33 - test_lrn_test (Failed)
[2024-04-15T23:11:47.404Z] 	 40 - test_pooling2d (Failed)
[2024-04-15T23:11:47.404Z] 	 41 - test_pooling3d (Failed)
[2024-04-15T23:11:47.404Z] 	 47 - test_soft_max (Failed)
[2024-04-15T23:11:47.404Z] 	 50 - test_sqlite_perfdb (Failed)
[2024-04-15T23:11:47.404Z] 	561 - CBAFind2InferSolverTest/ConvBiasActivFind2InferTestFloat.ConvBinWinogradRxSFind2Fused/(3, (N: 64 C:256 H:56 W:56 k: 64 y:1 x:1 pad_y:0 pad_x:0 stride_y:1 stride_x:1 dilation_y:1 dilation_x:1 conv_mode:0 ), 0) (Subprocess aborted)
[2024-04-15T23:11:47.405Z] 	594 - CBAFind2InferSolverTest/ConvBiasActivFind2InferTestFloat.ConvBinWinogradRxSf2x3g1Find2Fused/(3, (N: 64 C:64 H:56 W:56 k: 64 y:3 x:3 pad_y:1 pad_x:1 stride_y:1 stride_x:1 dilation_y:1 dilation_x:1 conv_mode:0 ), 0) (Failed)
x23 variations
[2024-04-15T23:11:47.405Z] 	711 - CBAInferSolverTest/ConvBiasActivInferTestFloat.ConvBinWinogradRxSf2x3g1Fused/(3, (N: 64 C:64 H:56 W:56 k: 64 y:3 x:3 pad_y:1 pad_x:1 stride_y:1 stride_x:1 dilation_y:1 dilation_x:1 conv_mode:0 ), 0) (Failed)
x23 variations
[2024-04-15T23:11:47.405Z] 	735 - CBAInferSolverTest/ConvBiasActivInferTestFloatFusionCompileStep.ConvBiasActivAsm1x1UFloat_testCompile/(3, (N: 1 C:64 H:56 W:56 k: 64 y:1 x:1 pad_y:0 pad_x:0 stride_y:1 stride_x:1 dilation_y:1 dilation_x:1 conv_mode:0 ), 0) (Failed)
[2024-04-15T23:11:47.405Z] 	736 - CBAInferSolverTest/ConvBiasActivInferTestFloatFusionCompileStep.ConvBiasActivAsm1x1UFloat_testCompile/(3, (N: 1 C:64 H:56 W:56 k: 64 y:3 x:3 pad_y:1 pad_x:1 stride_y:1 stride_x:1 dilation_y:1 dilation_x:1 conv_mode:0 ), 0) (Failed)
[2024-04-15T23:11:47.405Z] 	3751 - GroupNormTestSet/GroupNormTestFloat.GroupNormTestFw/ N:512 C:32 D:12 H:12 W:12 num_groups:4 eps:1e-05 mode:1 (Subprocess aborted)

cderb avatar Apr 16 '24 21:04 cderb

@junliume

@cderb should we skip them?

No, these should work

atamazov avatar Apr 17 '24 16:04 atamazov

Issue appears to be MI200 specific. Same tests are passing on MI300.

cderb avatar Apr 26 '24 21:04 cderb

The cause appears to be that the GPU is asleep during the copy and not waking back up when it should. Changing the grub options allowed these tests to pass on my test machine. https://github.com/ROCm/ROCm/issues/2418#issuecomment-1702415574

This was using rocm 6.1.0-82

cderb avatar May 01 '24 22:05 cderb

xnack+ make check now passing on machine with modified grub.

cderb avatar May 07 '24 16:05 cderb

xnack+ make check now passing on machine with modified grub.

Hopefully we can make it pass with the CI stage, do we know why it is still failing? http://micimaster.amd.com/blue/organizations/jenkins/MLLIBS%2FMIOpen/detail/xnack/15/pipeline

junliume avatar May 07 '24 21:05 junliume