MIOpen
MIOpen copied to clipboard
Experiment with XNACK+ on MI210
@junliume We need some info from @Kirpich30000 about configs that can be enabled for Winograd when XNACK is ON. There are some, at least. We discussed this topic with Ilya yesterday.
Related issue: #2865
@cderb it seems that somehow #2870 is not effective in this PR's CI?
@cderb it seems that somehow #2870 is not effective in this PR's CI?
I'll clean up here. Starting debug.
@cderb it seems that somehow #2870 is not effective in this PR's CI?
I'll clean up here. Starting debug.
@cderb a few tests are still failing the stage
@junliume It appears there are additional tests above what was addressed in #2870 which will hang when xnack is set manually in the environment by HSA_XNACK=1.
@junliume It appears there are additional tests above what was addressed in #2870 which will hang when xnack is set manually in the environment by HSA_XNACK=1.
@cderb should we skip them?
@junliume It appears there are additional tests above what was addressed in #2870 which will hang when xnack is set manually in the environment by HSA_XNACK=1.
@cderb should we skip them?
Looks like there are 14 tests affected. I'm not sure whether to skip them or not.
[2024-04-15T23:11:47.404Z] 1 - test_activation (Failed)
[2024-04-15T23:11:47.404Z] 5 - test_bn_peract_test (Failed)
[2024-04-15T23:11:47.404Z] 6 - test_bn_spatial_test (Failed)
[2024-04-15T23:11:47.404Z] 16 - test_ctc (Failed)
[2024-04-15T23:11:47.404Z] 33 - test_lrn_test (Failed)
[2024-04-15T23:11:47.404Z] 40 - test_pooling2d (Failed)
[2024-04-15T23:11:47.404Z] 41 - test_pooling3d (Failed)
[2024-04-15T23:11:47.404Z] 47 - test_soft_max (Failed)
[2024-04-15T23:11:47.404Z] 50 - test_sqlite_perfdb (Failed)
[2024-04-15T23:11:47.404Z] 561 - CBAFind2InferSolverTest/ConvBiasActivFind2InferTestFloat.ConvBinWinogradRxSFind2Fused/(3, (N: 64 C:256 H:56 W:56 k: 64 y:1 x:1 pad_y:0 pad_x:0 stride_y:1 stride_x:1 dilation_y:1 dilation_x:1 conv_mode:0 ), 0) (Subprocess aborted)
[2024-04-15T23:11:47.405Z] 594 - CBAFind2InferSolverTest/ConvBiasActivFind2InferTestFloat.ConvBinWinogradRxSf2x3g1Find2Fused/(3, (N: 64 C:64 H:56 W:56 k: 64 y:3 x:3 pad_y:1 pad_x:1 stride_y:1 stride_x:1 dilation_y:1 dilation_x:1 conv_mode:0 ), 0) (Failed)
x23 variations
[2024-04-15T23:11:47.405Z] 711 - CBAInferSolverTest/ConvBiasActivInferTestFloat.ConvBinWinogradRxSf2x3g1Fused/(3, (N: 64 C:64 H:56 W:56 k: 64 y:3 x:3 pad_y:1 pad_x:1 stride_y:1 stride_x:1 dilation_y:1 dilation_x:1 conv_mode:0 ), 0) (Failed)
x23 variations
[2024-04-15T23:11:47.405Z] 735 - CBAInferSolverTest/ConvBiasActivInferTestFloatFusionCompileStep.ConvBiasActivAsm1x1UFloat_testCompile/(3, (N: 1 C:64 H:56 W:56 k: 64 y:1 x:1 pad_y:0 pad_x:0 stride_y:1 stride_x:1 dilation_y:1 dilation_x:1 conv_mode:0 ), 0) (Failed)
[2024-04-15T23:11:47.405Z] 736 - CBAInferSolverTest/ConvBiasActivInferTestFloatFusionCompileStep.ConvBiasActivAsm1x1UFloat_testCompile/(3, (N: 1 C:64 H:56 W:56 k: 64 y:3 x:3 pad_y:1 pad_x:1 stride_y:1 stride_x:1 dilation_y:1 dilation_x:1 conv_mode:0 ), 0) (Failed)
[2024-04-15T23:11:47.405Z] 3751 - GroupNormTestSet/GroupNormTestFloat.GroupNormTestFw/ N:512 C:32 D:12 H:12 W:12 num_groups:4 eps:1e-05 mode:1 (Subprocess aborted)
@junliume
@cderb should we skip them?
No, these should work
Issue appears to be MI200 specific. Same tests are passing on MI300.
The cause appears to be that the GPU is asleep during the copy and not waking back up when it should. Changing the grub options allowed these tests to pass on my test machine. https://github.com/ROCm/ROCm/issues/2418#issuecomment-1702415574
This was using rocm 6.1.0-82
xnack+ make check now passing on machine with modified grub.
xnack+ make check now passing on machine with modified grub.
Hopefully we can make it pass with the CI stage, do we know why it is still failing? http://micimaster.amd.com/blue/organizations/jenkins/MLLIBS%2FMIOpen/detail/xnack/15/pipeline