Op4dTensorGeneric kernel upgrade

Open novakovicdj opened this issue 1 year ago • 0 comments

This PR is for new, upgraded, Op4dTensorGeneric kernel, this is part of porting kernels from OCL to HIP

Below is performance (speed-up and drops in performance) comparison between new Op4dTensorGeneric kernel and other OpTensor kernels used for 4d tensors.

This PR is opened as draft for now, if everyone is ok with this new Op4dTensorGeneric kernel I will update this PR and replace old kernel with this new one.

Test cases generated and run from tensor_4d_generic_ocl_hip.cpp file, largest tensor is 128MB,

New Op4dTensorGeneric - Old OpTensorFwdBias (B - 1C11 case)

47502 test runs, float data type
On whole test set average speed-up is x15.06

Tensor size	Speed-up
size <= 32KB	1.31
32KB < size <= 4MB	8.5
size > 4MB	19.86

Performance drop	% of test runs
more than 5%	24.4
more than 10%	15.1
more than 20%	6.8

New Op4dTensorGeneric - Old OpTensorLeadingOnes (B - N111, NC11, NCH1, 1111)

190009 test runs, float data type
On whole test set average speed-up is x26.12

Tensor size	Speed-up
size <= 32KB	1.39
32KB < size <= 4MB	12.69
size > 4MB	35.49

Performance drop	% of test runs
more than 5%	12.1
more than 10%	9.3
more than 20%	5.3

New Op4dTensorGeneric - Old Op4dTensorLite (B - NCHW)

Tried on 2750 and 7280 test runs, float data type
On whole test set average speed-up is below 1 (~0.75)

New Op4dTensorGeneric - Old Op4dTensorGeneric (B - all cases)

760032 test runs, float data type
On whole test set average speed-up is x29.58

Tensor size	Speed-up
size <= 32KB	1.95
32KB < size <= 4MB	15.94
size > 4MB	39.39

Performance drop	% of test runs
more than 5%	3.1
more than 10%	1.8
more than 20%	0.4

Jan 03 '25 14:01 novakovicdj