libdnn icon indicating copy to clipboard operation
libdnn copied to clipboard

OpenCL version?

Open FuriouslyCurious opened this issue 7 years ago • 20 comments

Hi Fabian, I couldn't figure out which version of OpenCL does libDNN support. Is it 1.1 or 2.x?

Thanks!

PS: hope AMD employees @fsword73 and @dagamayank can chip some optimized code in to make libDNN a fast replacement for cuDNN.

FuriouslyCurious avatar Jul 19 '16 07:07 FuriouslyCurious

@campbx It has OpenCL 1.1 support.

It is not fully optimized yet, but it's faster than any im2col/col2im (explicit GEMM) implementation.

naibaf7 avatar Jul 19 '16 10:07 naibaf7

I don't know if AMD could be interested to add a transparent backend for HSA like this. Actually libdnn rely on viennacl, and there was some official HSA backend initiative but I don't think that was upstreamed.

bhack avatar Jul 19 '16 12:07 bhack

/cc @gstoner

bhack avatar Jul 24 '16 10:07 bhack

Post ROCm 1.3 release. We will be putting out a developer release of OpenCL Language Runtime and Compiler on ROCm. This will be on our new native GCN ISA compiler. We holding to our promise we make the stack opensource. We had lot of work to do around OpenCL to make this happen.

It is big shift for us since we are no longer leveraging the our historical two stage compiler architecture.

  • First Pass was a high level LLVM based Compiler which did code generation to an IL( AMDIL or HSAIL) NVIDIA has PTX.
    • Second pass was we took binary of this IL then compiled it via our propritary shader compiler.

The LLVM native GCN ISA code generator has already been upstreamed http://llvm.org/docs/AMDGPUUsage.html. Also we now have released the Device libs for https://github.com/RadeonOpenCompute/ROCm-Device-Libs where you find the math intrinsics for OpenCL already.

You will also see we are active on CLANG OpenCL development.

You also find we have moved to standardize code object loader and API for compiler https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc

Lots of pieces we had to pull back togther and make them clean for comunity.

On the Viennacl lib, there was Graduate School port to HSA Runtime with HSAIL codegeneration as backend. We have not seem much progress on it.

gstoner avatar Sep 25 '16 17:09 gstoner

@gstoner Thanks for the update. I'll be looking into it. There is certainly an interest on making LibDNN also compatible on new platforms.

ViennaCL is no strict requirement, the CUDA backend for LibDNN goes around ViennaCL and is used natively. So the same is possible by using HSA.

naibaf7 avatar Sep 25 '16 17:09 naibaf7

We looked at it lot ViennaCL HIP port might be another way to attach to the platform.

gstoner avatar Sep 25 '16 18:09 gstoner

@bhack @gstoner Just as a heads-up I'm currently writing Pooling kernels for LibDNN. I found that they can also have a performance benefit (in the 10%-20% range for AlexNet total forwarding time) when using an optimized kernel for each tuple of pooling parameters compared against a single all-purpose dimension-iterative kernel. Especially when going 3D.

If you look at the following kernel it's obvious why (lots of branching, low ILP, no reuse of computed offsets over the feature maps and batch size, quite many registers lost for arrays, fixed to a maximum of 6 spatial dimensions, etc.):

__kernel void TEMPLATE(max_pool_forward_nd, Dtype)(const int_tp n,
                                                   const int_tp num_axes,
                                                   __global const Dtype* bottom_data,
                                                   const int_tp channels,
                                                   __global const int_tp* size,
                                                   __global const int_tp* pooled_size,
                                                   __global const int_tp* kernel_size,
                                                   __global const int_tp* ext_kernel_size,
                                                   __global const int_tp* stride,
                                                   __global const int_tp* dilation,
                                                   __global const int_tp* pad,
                                                   __global Dtype* top_data,
                                                   const int use_mask,
                                                   __global int_tp* mask, __global Dtype* top_mask) {
  int_tp d_idx[6];
  int_tp d_start[6];
  int_tp d_end[6];
  int_tp d_iter[6];
  int_tp i;

  for (int_tp index = get_global_id(0); index < n; index += get_global_size(0)) {
    int_tp offset = 1;
    int_tp num = index;

    bool do_continue = false;

    for (i = num_axes - 1; i >= 0; --i) {
      d_idx[i] = num % pooled_size[i];
      d_start[i] = d_idx[i] * stride[i] - pad[i];
      d_end[i] = min(d_start[i] + ext_kernel_size[i], size[i]);
      d_start[i] = max(d_start[i], (int_tp)0);
      num /= pooled_size[i];
      offset *= size[i];
      d_iter[i] = d_start[i];

      if (d_start[i] >= d_end[i]) {
        top_data[index] = -FLT_MAX;
        if (use_mask) {
          mask[index] = -1;
        } else {
          top_mask[index] = -1;
        }
        do_continue = true;
      }
    }

    if(do_continue) {
      continue;
    }

    int_tp chan = num % channels;
    num /= channels;
    offset *= (num * channels + chan);

    Dtype maxval = -FLT_MAX;
    int_tp maxidx = -1;
    int_tp final_offset = 0;

    bool incremented;
    do {
      final_offset = offset;
      int_tp size_prod = 1;
      for (i = num_axes - 1; i >= 0; --i) {
        final_offset += d_iter[i] * size_prod;
        size_prod *= size[i];
      }

      if (bottom_data[final_offset] > maxval) {
        maxidx = final_offset;
        maxval = bottom_data[maxidx];
      }

      incremented = false;
      for (i = num_axes - 1; i >= 0; --i) {
        if (d_iter[i] >= d_end[i] - dilation[i]) {
          d_iter[i] = d_start[i];
        } else {
          d_iter[i] += dilation[i];
          incremented = true;
          break;
        }
      }
    } while (incremented);

    top_data[index] = maxval;
    if (use_mask == 1) {
      mask[index] = maxidx;
    } else {
      top_mask[index] = maxidx;
    }
  }
}

(this is the current all-purpose ND-pooling kernel in OpenCL Caffe for Max-pooling.

Currently, with FGLRX, (OpenCL 2.0) the W9100 can do about 700 images/second in AlexNet. A GTX 1080 can do up to 3000/second with cuDNN. So there is still a long way to go, but I hope we can do 1200 images/second on the W9100/RX480. A step back was the AMDGPU-PRO driver (OpenCL 1.2), which currently reduces the performance of a W9100 to 450 images/second.

Optimally, by mid-2017, a Vega card should do 2600 images/second FP32 and 5200 images/second FP16. But that is wishful speculation ;)

naibaf7 avatar Sep 26 '16 01:09 naibaf7

Are we going out of libdnn/convolution dogma? ;)

bhack avatar Sep 26 '16 09:09 bhack

@gstoner as we have already discissed with @naibaf7 this kind of Hsail kernels approach doesn't improve the competiton against cudnn so much.

bhack avatar Sep 26 '16 09:09 bhack

@bhack I add kernels that have a significant performance effect or are required for work in my other projects as I go along, it might violate the dogma ;)

naibaf7 avatar Sep 26 '16 10:09 naibaf7

It would be interesting to know if AMD with @GPUOpen-ProfessionalCompute-Libraries will support the new neural OpenVx extension with https://github.com/GPUOpen-ProfessionalCompute-Libraries/amdovx-modules/

bhack avatar Sep 26 '16 10:09 bhack

First, we are working on optimized Deep Learning solver for our hardware. This is only way to close the gap with cuDNN. We have a dedicated team working on this, and some very interesting advisor helping us. We are also working on more optimized version of Caffe, Tensorflow and Torch7.

We have also been focusing ROCm on the needs Deep learning. There are number key capabilities we need to bring to our drivers help up make it easier drive application optimization and scale.

  • Native Code Object Loader with AMDGPU ABI
  • Native Code Generator LLVM to GCN (HCC, HIP and OpenCL will use common backend)
  • GCN Assembler and Disassembler Support
  • Native Peer to Peer Support for Intra and Inter node (RDMA)
  • Better Clock Control via SMI interface ( ROCm 1.3 SMI wil have even better control)

We working on some new capabilities for Performance Tools & Debugging beyond what you’re seeing in the public today.

It is lot of moving parts, but SC 2016 will be one-year anniversary since we announce Boltzmann Initiative. In that time, - We released ROCm 1.0 in April with Fiji Support,

  • ROCm 1.1 in June.
  • ROCm 1.2 with Hawaii support,
  • ROCm 1.3 goes alpha First week of Oct.

We now have velocity and focus, Our Deep Learning Solver will have the same amount intensity.

Welcome to the new Radeon Open Compute Program.

gstoner avatar Sep 26 '16 13:09 gstoner

On Float16, this is article I put toghter bellow, we working hard to get full F16/init16 Instruction support into the new GCN Compiler. This is stage 1 Float 16 support.

https://radeonopencompute.github.io/GCN_Float16.html ROC ON, Float16 and Integer16 support in AMD GPUs It has been a secret for too long. AMD GPUs do support Float16 and Int16 instructions. The current GPUs execute at same speed as Float32.

Fiji Family of Hardware: Radeon R9 Nano, R9 Fury, R9 Fury X, FirePro S9300x2, Tonga Family of Hardware: R9 285, R9 380, R9 380x, FirePro S7150x2, S7150, W7100 Polaris Family of Hardware: RX480. RX470, RX460

We will also expose our GCN 3 ISA via assembler directly support by compiler. The new LLVM Native GCN ISA compiler supports a disassembler, assembler and soon inline-assembly so you be able tune your code even further.

ROCm Compilers will be bring full richness of FLOAT16 and Int16 via HCC, HIP and OpenCL.

You can find out more on Float16 and other instruction in the GCN version 3 ISA manual

    Here are examples of some of the instructions supported:

V_FREXP_EXP_I16_F16 Returns exponent of half precision float input, such that the original single float = significand * (2 * exponent). V_CVT_F16_F32 Float32 to Float16. V_ADD_F16 D.f16 = S0.f16 + S1.f16. Supports denormals, round mode, exception flags, saturation. V_SUB_F16 D.f16 = S0.f16 - S1.f16. Supports denormals, round mode, exception flags, saturation. SQ translates to V_ADD_F16. V_MAC_F16 16-bit floating point multiply -accumulate V_FMA_F16.Fused half precision multiply add. V_MAD_F16 Floating point multiply-add (MAD). Gives same result as ADD after MUL_IEEE. Uses IEEE rules for 0anything. V_MADAK_F16 16-bit floating-point multiply-add with constant add operand. V_MADMK_F16 16-bit floating-point multiply-add with multiply operand immediate. V_COS_F16 Cosine function V_SIN_F16 Sin function V_EXP_F16 Base2 exponent function V_LOG_F16 Base2 log function. V_SQRT_F16 if(S0.f16 == 1.0f) D.f16 = 1.0f; else D.f16 = ApproximateSqrt(S0.f16). V_FRACT_F16 Floating point ‘fractional’ part of S0.f. V_RCP_F16 if (S0.f16 == 1.0f), D.f16 = 1.0f; else D.f16 = ApproximateRecip(S0.f16). V_RSQ_F16 if(S0.f16 == 1.0f) D.f16 = 1.0f; else D.f16 = ApproximateRecipSqrt(S0.f16). V_RNDNE_F16 Floating-point Round-to-Nearest-Even Integer. V_TRUNC_F16 Floating point ‘integer’ part of S0.f. D.f16 = trunc(S0.f16). Round-to-zero semantics. V_LDEXP_F16 V_CEIL_F16 Floating point ceiling function. V_FLOOR_F16 Floating-point floor function V_MAX_F16 D.f16 = max(S0.f16, S1.f16). IEEE compliant. Supports denormals, round mode, exception flags, saturation. V_MAX_I16 D.f16 = max(S0.f16, S1.f16). IEEE compliant. Supports denormals, round mode, exception flags, saturation. V_MIN_F16 D.f16 = min(S0.f16, S1.f16). IEEE compliant. Supports denormals, round mode, exception flags, saturation. V_CVT_PKRTZ_F16_F32 Convert two float 32 numbers into a single register holding two packed 16-bit floats. V_DIV_FIXUP_F16 Given a numerator, denominator, and quotient from a divide, this opcode detects and applies special case numerics, modifies the quotient if necessary. This opcode also generates invalid, denorm, and divide by zero exceptions caused by the division. V_SUBREV_F16 D.f16 = S1.f16 - S0.f16. Supports denormals, round mode, exception flags, saturation. SQ translates to V_ADD_F16. + Also the GCN 3 Architecture supports 32-bit, 24-bit, and 16-bit integer math.

V_ADD_U16 D.u16 = S0.u16 + S1.u16. Supports saturation (unsigned 16-bit integer domain). V_SUB_U16 D.u16 = S0.u16 - S1.u16. Supports saturation (unsigned 16-bit integer domain). V_MAD_I16 Signed integer muladd. V_MAD_U16 Unsigned integer muladd. V_SAD_U16 Sum of absolute differences with accumulation. V_MAX_I16 D.i[15:0] = max(S0.i[15:0], S1.i[15:0]). V_MAX_U16 D.u[15:0] = max(S0.u[15:0], S1.u[15:0]). V_MIN_I16 D.i[15:0] = min(S0.i[15:0], S1.i[15:0]). V_MIN_U16 D.u[15:0] = min(S0.u[15:0], S1.u[15:0]). V_MUL_LO_U16 D.u16 = S0.u16 * S1.u16. Supports saturation (unsigned 16-bit integer domain). V_CVT_F16_U16 D.f16 = uint16_to_flt16(S.u16). Supports denormals, rounding, exception flags and saturation. V_CVT_F16_I16 D.f16 = int16_to_flt16(S.i16). Supports denormals, rounding, exception flags and saturation V_SUBREV_U16 D.u16 = S1.u16 - S0.u16. Supports saturation (unsigned 16-bit integer domain). SQ translates this to V_SUB_U16 with reversed operands.

gstoner avatar Sep 26 '16 13:09 gstoner

/cc @hughperkins I think that he could be interested in last comments of this thread.

bhack avatar Sep 26 '16 13:09 bhack

  • inline assembly sounds good
  • I would also be tentatively interested in being able to pass in LLVM IR directly somehow, ideally via standard OpenCL api

hughperkins avatar Sep 26 '16 14:09 hughperkins

We had this with OpenCL 1.2 SPIR 1.2 with OpenCL, three people in the world used it My team, Continuum IO and Codeplay. Honestly this path is great for Compiler Prototypes, but when you get deep into your work you want more control over the compiler.

My team and Continuum IO move away from it since it was overly constrained solution and did not allow you to solve key problem you face when you bring a lanuguage that is not like OpenCL, on the platform.

On ROCm we give you the full LLVM IR interface since we are have up streamed the full source to the AMDGPU GCN compiler, and you have low level access to the ROCr system runtime for when you really want to tune it for performane.

  • https://github.com/RadeonOpenCompute/llvm
  • http://llvm.org/docs/AMDGPUUsage.html

You can extend the ROCm device-library compiler intrinsics with the now public Open Compute Math Library and Open Compute kernel language:

  • https://github.com/RadeonOpenCompute/ROCm-Device-Libs

We now have standardized loader Interface, here is the ABI documentation which we plumb up via our language runtimes as well.

  • https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md

One big thing, the compiler is be developed so we can do true offline compilation, and can be upgraded it independent of the driver. Also ROCr is language independent system runtime, which you can load binary and language runtime at execution, just like you do with CPU based software development. No more monolthic blob of stuff.

gstoner avatar Sep 26 '16 14:09 gstoner

Intel Beignet: 2.0 done https://lists.freedesktop.org/archives/beignet/2017-January/008476.html

bhack avatar Jan 21 '17 14:01 bhack

Cool, will test next week :)

naibaf7 avatar Jan 21 '17 23:01 naibaf7

@naibaf7 Is there a possibility to have upstreamed Intel kernels? Cause I think mkl-dnn and mkl 2017 will cover only CPU.

bhack avatar Jan 22 '17 14:01 bhack

Thanks @bhack for a link - I will see if I can use it on my laptop with Ubuntu 16.04. I just had some bad experience installing Intel GPU drivers to support OpenCL on Ubuntu in the past, so hope it became a bit more user-friendly ;) ...

gfursin avatar Jan 23 '17 11:01 gfursin