MIOpen icon indicating copy to clipboard operation
MIOpen copied to clipboard

Op4dTensorGeneric kernel upgrade

Open novakovicdj opened this issue 1 year ago • 0 comments

This PR is for new, upgraded, Op4dTensorGeneric kernel, this is part of porting kernels from OCL to HIP

Below is performance (speed-up and drops in performance) comparison between new Op4dTensorGeneric kernel and other OpTensor kernels used for 4d tensors.

This PR is opened as draft for now, if everyone is ok with this new Op4dTensorGeneric kernel I will update this PR and replace old kernel with this new one.

Test cases generated and run from tensor_4d_generic_ocl_hip.cpp file, largest tensor is 128MB,

New Op4dTensorGeneric - Old OpTensorFwdBias (B - 1C11 case)

  • 47502 test runs, float data type
  • On whole test set average speed-up is x15.06
Tensor size Speed-up
size <= 32KB 1.31
32KB < size <= 4MB 8.5
size > 4MB 19.86
Performance drop % of test runs
more than 5% 24.4
more than 10% 15.1
more than 20% 6.8

New Op4dTensorGeneric - Old OpTensorLeadingOnes (B - N111, NC11, NCH1, 1111)

  • 190009 test runs, float data type
  • On whole test set average speed-up is x26.12
Tensor size Speed-up
size <= 32KB 1.39
32KB < size <= 4MB 12.69
size > 4MB 35.49
Performance drop % of test runs
more than 5% 12.1
more than 10% 9.3
more than 20% 5.3

New Op4dTensorGeneric - Old Op4dTensorLite (B - NCHW)

  • Tried on 2750 and 7280 test runs, float data type
  • On whole test set average speed-up is below 1 (~0.75)

New Op4dTensorGeneric - Old Op4dTensorGeneric (B - all cases)

  • 760032 test runs, float data type
  • On whole test set average speed-up is x29.58
Tensor size Speed-up
size <= 32KB 1.95
32KB < size <= 4MB 15.94
size > 4MB 39.39
Performance drop % of test runs
more than 5% 3.1
more than 10% 1.8
more than 20% 0.4

novakovicdj avatar Jan 03 '25 14:01 novakovicdj