MIOpen
MIOpen copied to clipboard
Implement Diag Forward
- Added Diag Forward operations
- Added driver test and gtest for Diag operations
The kernel is only 20% faster than ROCm if the following constraints are applied:
- tensor dim num = 2.
- number of elements in input tensor > 4096576
Detail Benchmark
float16
| Ops name | dtype | size | contiguous | diagonal | direction | ROCm | MIOpen | Improvement |
|---|---|---|---|---|---|---|---|---|
| Diag | float16 | [9016 4048] | contiguous | -50 | fwd | 7808 | 6026 | 1.30 |
| Diag | float16 | [9016 4048] | noncontiguous | -50 | fwd | 8560 | 6026 | 1.42 |
| Diag | float16 | [9016 4048] | contiguous | 0 | fwd | 7280 | 6026 | 1.21 |
| Diag | float16 | [9016 4048] | noncontiguous | 0 | fwd | 8048 | 5991 | 1.34 |
| Diag | float16 | [9016 9016] | contiguous | -50 | fwd | 10112 | 6381 | 1.58 |
| Diag | float16 | [9016 9016] | noncontiguous | -50 | fwd | 10144 | 6470 | 1.57 |
| Diag | float16 | [9016 9016] | contiguous | 0 | fwd | 10464 | 6399 | 1.64 |
| Diag | float16 | [9016 9016] | noncontiguous | 0 | fwd | 10512 | 6452 | 1.63 |
| Diag | float16 | [18132 9016] | contiguous | -50 | fwd | 10608 | 6416 | 1.65 |
| Diag | float16 | [18132 9016] | noncontiguous | -50 | fwd | 12768 | 6452 | 1.98 |
| Diag | float16 | [18132 9016] | contiguous | 0 | fwd | 10368 | 6381 | 1.62 |
| Diag | float16 | [18132 9016] | noncontiguous | 0 | fwd | 12384 | 6363 | 1.95 |
float32
| Ops name | dtype | size | contiguous | diagonal | direction | ROCm | MIOpen | Improvement |
|---|---|---|---|---|---|---|---|---|
| Diag | float32 | [9016 4048] | contiguous | -50 | fwd | 8288 | 5937 | 1.40 |
| Diag | float32 | [9016 4048] | noncontiguous | -50 | fwd | 9888 | 5920 | 1.67 |
| Diag | float32 | [9016 4048] | contiguous | 0 | fwd | 7856 | 5991 | 1.31 |
| Diag | float32 | [9016 4048] | noncontiguous | 0 | fwd | 9728 | 5849 | 1.66 |
| Diag | float32 | [9016 9016] | contiguous | -50 | fwd | 13952 | 6523 | 2.14 |
| Diag | float32 | [9016 9016] | noncontiguous | -50 | fwd | 13280 | 6434 | 2.06 |
| Diag | float32 | [9016 9016] | contiguous | 0 | fwd | 14048 | 6666 | 2.11 |
| Diag | float32 | [9016 9016] | noncontiguous | 0 | fwd | 14064 | 6523 | 2.16 |
| Diag | float32 | [18132 9016] | contiguous | -50 | fwd | 14160 | 6523 | 2.17 |
| Diag | float32 | [18132 9016] | noncontiguous | -50 | fwd | 17184 | 6399 | 2.69 |
| Diag | float32 | [18132 9016] | contiguous | 0 | fwd | 13408 | 6541 | 2.05 |
| Diag | float32 | [18132 9016] | noncontiguous | 0 | fwd | 16576 | 6470 | 2.56 |
| Diag | float32 | [36264 18032] | contiguous | -50 | fwd | 19504 | 11057 | 1.76 |
| Diag | float32 | [36264 18032] | noncontiguous | -50 | fwd | 35632 | 13492 | 2.64 |
| Diag | float32 | [36264 18032] | contiguous | 0 | fwd | 19552 | 7484 | 2.61 |
| Diag | float32 | [36264 18032] | noncontiguous | 0 | fwd | 39248 | 13493 | 2.91 |
bfloat16
| Ops name | dtype | size | contiguous | diagonal | direction | ROCm | MIOpen | Improvement |
|---|---|---|---|---|---|---|---|---|
| Diag | bfloat16 | [9016 4048] | contiguous | 0 | fwd | 7040 | 6097 | 1.15 |
| Diag | bfloat16 | [9016 4048] | noncontiguous | 0 | fwd | 7904 | 6471 | 1.22 |
| Diag | bfloat16 | [9016 4048] | contiguous | 50 | fwd | 7136 | 5990 | 1.19 |
| Diag | bfloat16 | [9016 4048] | noncontiguous | 50 | fwd | 8064 | 5794 | 1.39 |
| Diag | bfloat16 | [9016 9016] | contiguous | 0 | fwd | 10320 | 6452 | 1.60 |
| Diag | bfloat16 | [9016 9016] | noncontiguous | 0 | fwd | 10208 | 6594 | 1.55 |
| Diag | bfloat16 | [9016 9016] | contiguous | 50 | fwd | 10384 | 6416 | 1.62 |
| Diag | bfloat16 | [9016 9016] | noncontiguous | 50 | fwd | 10272 | 6523 | 1.57 |
| Diag | bfloat16 | [18132 9016] | contiguous | 0 | fwd | 10416 | 6399 | 1.63 |
| Diag | bfloat16 | [18132 9016] | noncontiguous | 0 | fwd | 12784 | 6417 | 1.99 |
| Diag | bfloat16 | [18132 9016] | contiguous | 50 | fwd | 10608 | 6364 | 1.67 |
| Diag | bfloat16 | [18132 9016] | noncontiguous | 50 | fwd | 12304 | 6381 | 1.93 |
| Diag | bfloat16 | [36264 18032] | contiguous | 0 | fwd | 18048 | 7360 | 2.45 |
| Diag | bfloat16 | [36264 18032] | noncontiguous | 0 | fwd | 24224 | 7288 | 3.32 |
| Diag | bfloat16 | [36264 18032] | contiguous | 50 | fwd | 17248 | 7288 | 2.37 |
| Diag | bfloat16 | [36264 18032] | noncontiguous | 50 | fwd | 24416 | 7271 | 3.36 |
Average performance:
| fwd | |
|---|---|
| float16 | 1.57 |
| float32 | 2.12 |
| bfloat16 | 1.88 |