Op4dTensorGeneric kernel upgrade
This PR is for new, upgraded, Op4dTensorGeneric kernel, this is part of porting kernels from OCL to HIP
Below is performance (speed-up and drops in performance) comparison between new Op4dTensorGeneric kernel and other OpTensor kernels used for 4d tensors.
This PR is opened as draft for now, if everyone is ok with this new Op4dTensorGeneric kernel I will update this PR and replace old kernel with this new one.
Test cases generated and run from tensor_4d_generic_ocl_hip.cpp file, largest tensor is 128MB,
New Op4dTensorGeneric - Old OpTensorFwdBias (B - 1C11 case)
- 47502 test runs, float data type
- On whole test set average speed-up is x15.06
| Tensor size | Speed-up |
|---|---|
| size <= 32KB | 1.31 |
| 32KB < size <= 4MB | 8.5 |
| size > 4MB | 19.86 |
| Performance drop | % of test runs |
|---|---|
| more than 5% | 24.4 |
| more than 10% | 15.1 |
| more than 20% | 6.8 |
New Op4dTensorGeneric - Old OpTensorLeadingOnes (B - N111, NC11, NCH1, 1111)
- 190009 test runs, float data type
- On whole test set average speed-up is x26.12
| Tensor size | Speed-up |
|---|---|
| size <= 32KB | 1.39 |
| 32KB < size <= 4MB | 12.69 |
| size > 4MB | 35.49 |
| Performance drop | % of test runs |
|---|---|
| more than 5% | 12.1 |
| more than 10% | 9.3 |
| more than 20% | 5.3 |
New Op4dTensorGeneric - Old Op4dTensorLite (B - NCHW)
- Tried on 2750 and 7280 test runs, float data type
- On whole test set average speed-up is below 1 (~0.75)
New Op4dTensorGeneric - Old Op4dTensorGeneric (B - all cases)
- 760032 test runs, float data type
- On whole test set average speed-up is x29.58
| Tensor size | Speed-up |
|---|---|
| size <= 32KB | 1.95 |
| 32KB < size <= 4MB | 15.94 |
| size > 4MB | 39.39 |
| Performance drop | % of test runs |
|---|---|
| more than 5% | 3.1 |
| more than 10% | 1.8 |
| more than 20% | 0.4 |