MIOpen icon indicating copy to clipboard operation
MIOpen copied to clipboard

Implement Diag Forward

Open cognaiger9 opened this issue 11 months ago • 0 comments

  • Added Diag Forward operations
  • Added driver test and gtest for Diag operations

The kernel is only 20% faster than ROCm if the following constraints are applied:

  • tensor dim num = 2.
  • number of elements in input tensor > 4096576

Detail Benchmark

float16
Ops name dtype size contiguous diagonal direction ROCm MIOpen Improvement
Diag float16 [9016 4048] contiguous -50 fwd 7808 6026 1.30
Diag float16 [9016 4048] noncontiguous -50 fwd 8560 6026 1.42
Diag float16 [9016 4048] contiguous 0 fwd 7280 6026 1.21
Diag float16 [9016 4048] noncontiguous 0 fwd 8048 5991 1.34
Diag float16 [9016 9016] contiguous -50 fwd 10112 6381 1.58
Diag float16 [9016 9016] noncontiguous -50 fwd 10144 6470 1.57
Diag float16 [9016 9016] contiguous 0 fwd 10464 6399 1.64
Diag float16 [9016 9016] noncontiguous 0 fwd 10512 6452 1.63
Diag float16 [18132 9016] contiguous -50 fwd 10608 6416 1.65
Diag float16 [18132 9016] noncontiguous -50 fwd 12768 6452 1.98
Diag float16 [18132 9016] contiguous 0 fwd 10368 6381 1.62
Diag float16 [18132 9016] noncontiguous 0 fwd 12384 6363 1.95
float32
Ops name dtype size contiguous diagonal direction ROCm MIOpen Improvement
Diag float32 [9016 4048] contiguous -50 fwd 8288 5937 1.40
Diag float32 [9016 4048] noncontiguous -50 fwd 9888 5920 1.67
Diag float32 [9016 4048] contiguous 0 fwd 7856 5991 1.31
Diag float32 [9016 4048] noncontiguous 0 fwd 9728 5849 1.66
Diag float32 [9016 9016] contiguous -50 fwd 13952 6523 2.14
Diag float32 [9016 9016] noncontiguous -50 fwd 13280 6434 2.06
Diag float32 [9016 9016] contiguous 0 fwd 14048 6666 2.11
Diag float32 [9016 9016] noncontiguous 0 fwd 14064 6523 2.16
Diag float32 [18132 9016] contiguous -50 fwd 14160 6523 2.17
Diag float32 [18132 9016] noncontiguous -50 fwd 17184 6399 2.69
Diag float32 [18132 9016] contiguous 0 fwd 13408 6541 2.05
Diag float32 [18132 9016] noncontiguous 0 fwd 16576 6470 2.56
Diag float32 [36264 18032] contiguous -50 fwd 19504 11057 1.76
Diag float32 [36264 18032] noncontiguous -50 fwd 35632 13492 2.64
Diag float32 [36264 18032] contiguous 0 fwd 19552 7484 2.61
Diag float32 [36264 18032] noncontiguous 0 fwd 39248 13493 2.91
bfloat16
Ops name dtype size contiguous diagonal direction ROCm MIOpen Improvement
Diag bfloat16 [9016 4048] contiguous 0 fwd 7040 6097 1.15
Diag bfloat16 [9016 4048] noncontiguous 0 fwd 7904 6471 1.22
Diag bfloat16 [9016 4048] contiguous 50 fwd 7136 5990 1.19
Diag bfloat16 [9016 4048] noncontiguous 50 fwd 8064 5794 1.39
Diag bfloat16 [9016 9016] contiguous 0 fwd 10320 6452 1.60
Diag bfloat16 [9016 9016] noncontiguous 0 fwd 10208 6594 1.55
Diag bfloat16 [9016 9016] contiguous 50 fwd 10384 6416 1.62
Diag bfloat16 [9016 9016] noncontiguous 50 fwd 10272 6523 1.57
Diag bfloat16 [18132 9016] contiguous 0 fwd 10416 6399 1.63
Diag bfloat16 [18132 9016] noncontiguous 0 fwd 12784 6417 1.99
Diag bfloat16 [18132 9016] contiguous 50 fwd 10608 6364 1.67
Diag bfloat16 [18132 9016] noncontiguous 50 fwd 12304 6381 1.93
Diag bfloat16 [36264 18032] contiguous 0 fwd 18048 7360 2.45
Diag bfloat16 [36264 18032] noncontiguous 0 fwd 24224 7288 3.32
Diag bfloat16 [36264 18032] contiguous 50 fwd 17248 7288 2.37
Diag bfloat16 [36264 18032] noncontiguous 50 fwd 24416 7271 3.36

Average performance:

fwd
float16 1.57
float32 2.12
bfloat16 1.88

cognaiger9 avatar Feb 14 '25 02:02 cognaiger9