Arraymancer icon indicating copy to clipboard operation
Arraymancer copied to clipboard

discussion: porting Halide to nim

Open timotheecour opened this issue 5 years ago • 4 comments

opening this issue to track ideas for how to port Halide to nim. This could be either using halide in Nim via:

  • A1: a wrapper (eg using nimterop) or
  • A2: re-implementing some ideas in nim directly, which although a lot of work would result in a much simpler solution given nim's better DSL abilities

proposal: approach 1

use a plugin at CT approach as done in nimterop (https://github.com/genotrance/nimterop/pull/66) to transparently compile (using halide) a c++/Halide DSL, save to a shared library and cache it; eg:

# this is in Nim

import halide_funs

cPluginHalide: """ // this is compiled on the fly and cached, using a hash of the string, eg saved to: /tmp/nimhalide/libplugin_12abf23.dylib
// this is C++ halide code
Func blur_3x3(Func input) { // example from https://en.wikipedia.org/wiki/Halide_(programming_language)
  Func blur_x, blur_y;
  Var x, y, xi, yi;

  // The algorithm - no storage or order
  blur_x(x, y) = (input(x-1, y) + input(x, y) + input(x+1, y))/3;
  blur_y(x, y) = (blur_x(x, y-1) + blur_x(x, y) + blur_x(x, y+1))/3;

  // The schedule - defines order, locality; implies storage
  blur_y.tile(x, y, xi, yi, 256, 32)
        .vectorize(xi, 8).parallel(y);
  blur_x.compute_at(blur_y, x).vectorize(x, 8);

  return blur_y;
}"""

proc main() =
  let image = loadImage("img.jpg")
  blur_3x3(image)

[EDIT] proposal: approach 2

using https://github.com/fragcolor-xyz/nimline could be another possiblity, but would probalby need adaptation as halide compiler needs to be run on C++/halide program

related

  • halide: https://en.wikipedia.org/wiki/Halide_(programming_language)

  • some discussion on gitter https://gitter.im/Arraymancer/Lobby?at=5c4f1b7d13a2814df6ded8f5 + a few hours before that

  • https://github.com/mratsim/Arraymancer/issues/203 briefly mentions halide but wasn't the focus of that issue

  • other language integrations:

  • tutorial http://halide-lang.org/tutorials/tutorial_lesson_16_rgb_generate.html

  • nlvm: https://github.com/arnetheduck/nlvm

timotheecour avatar Jan 28 '19 20:01 timotheecour

As discussed, this sounds like a very interesting idea and from a learning perspective I'd rather implement it from scratch, just like Arraymancer started as a way to learn NN from scratch instead of reusing Tensorflow or Torch.

Overview of Halide

Halide is a C++ domain specific language for computational image processing. It started with the idea of separating the algorithm (the what) from its efficient hardware implementation (the how). This is similar to a (functional programming) source file vs machine code when passed through an optimizing compiler.

The "how" can be hinted to assist the compiler in its threading and vectorization by specifying cache blocking parameters, unrolling factor and threading hints.

Furthermore Halide supports multiple backends (x86, ARM, Vulkan, DX12, OpenGL, OpenCL, Cuda ...) including a JIT backend that will dispatch with the latest instruction set supported (for example AVX512).

The pipeline is: C++ DSL --> Halide Intermediate Representation (Computational Graph) --> LLVM (AOT or JIT) --> machine code.

## Advantages

In general

  • Short code, much faster to develop new efficient algorithms.
  • No need to fiddle with platform specific optimisations like SIMD.
  • Portability
  • Computation can be fused across functions and more parallelism can be exposed.
  • Much easier to test different parallel schedule

For ML and DL

The main advantages are the portability with SIMD optimisations opportunities and computation fusion to prevent intermediate allocation and results.

Parallelism in ML/DL is either straightforward (it's a single loop) or very complex and would need manual involvement:

  • ~For example Halide is memory-bound on matrix multiplication (0.5~1 TFLOP/s at most while I can reach 1.8TFLOP/s with OpenBLAS or MKLDNN)~ Perf reporting in Halide is confusing, I revamped it and it reaches MKL perf on single socket CPUs.
  • Halide is unsuited for trees ensembles like random forests

Some DL layers that would benefits from Halide-like functionality:

  • Convolution (apparently Halide has a schedule that reaches MKL-DNN performance)
  • Locally-Connected Layer/Unshared Convolution (#255). Though this might be faster with summed-area table/integral image and then a simple for loop.
  • Bilinear
  • Pooling layers (Max, Average, Region of Interest Pooling)

Disadvantages

  1. Halide is pretty much a static graph compiler like Tensorflow XLA which is not that great for research and debugging. It's a pain to use Tensorflow for NLP, RNNs and GANs for example.

    The main issue is the lack of control flow like for loop and if/else branches.

Note that OpenAI Dali (not to be confused with Nvidia DALI) apparently is a compiler that plugs in dynamic graphs framework.

  1. Another disadvantage is the LLVM dependency at runtime, but that only concerns JIT. LLVM is a huge library.

Halide papers

  • Jonathan Ragan-Kelley thesis: https://people.csail.mit.edu/jrk/jrkthesis.pdf
  • Halide a language and compiler for optimising, parallelism, locality and recomputation in image processing pipelines: http://people.csail.mit.edu/jrk/halide-pldi13.pdf
  • Compiling high performance recursive filters: http://www-sop.inria.fr/reves/Basilic/2015/CRPDD15a/recfilter.pdf
  • CVPR slides - recursive filtering: https://halide-lang.org/assets/lectures/CVPR_Halide_RecursiveFiltering.pdf
  • Parallel Associative Reduction in Halide: https://andrew.adams.pub/par.pdf
  • Automatically Scheduling Halide Image Processing Pipelines: http://graphics.cs.cmu.edu/projects/halidesched/mullapudi16_halidesched.pdf

Beyond Halide

Halide has been expanded with autodiff capabilities in gradient-halide which may greatly enhance autograd which is a recurrent source of slowness in Arraymancer due to the need of a DAG which doesn't play well with reference counting. However while gradient-halide results are promising, TVM (a fork of Halide for DL) reported perf issues

Related projects

  • Tensorflow XLA: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/g3doc/overview.md
  • Tensorflow MLIR: https://github.com/tensorflow/mlir
  • PyTorch Glow: https://github.com/pytorch/glow
  • Dask: https://dask.org
  • Gradient Halide: https://people.csail.mit.edu/tzumao/gradient_halide/
  • OpenAI Dali (can't find the repo): http://learningsys.org/nips18/assets/papers/100CameraReadySubmissionDali__Lazy_Compilation_of_Dynamic_Computation_Graphs.pdf
  • DLMC TVM: https://github.com/dmlc/tvm
  • Intel PlaidML: https://github.com/plaidml/plaidml
  • Intel Nervana Ngraph: https://github.com/NervanaSystems/ngraph/tree/master/python
  • The Tensor Algebra Compiler: https://github.com/tensor-compiler/taco
  • Numba: http://numba.pydata.org (less related as it doesn't compile Expression Graphs)
  • PyGPU (2006): An embedded DSL for high speed image processing
  • Stanford's Delite and OptiML (2011):
  • DLVM Deep Learning Virtual Machine (2017 - Authors migrated to Swift for Tensorflow)
    • https://arxiv.org/pdf/1711.03016.pdf
    • https://github.com/dlvm-team/dlvm-core
  • TriNNity:
    • https://www.scss.tcd.ie/~andersan/projects/live/triNNity.html
    • https://bitbucket.org/STG-TCD/trinnity/src/master/
  • Graphi, scheduling deep learning computation graphs on manycore CPU:
    • https://arxiv.org/abs/1807.09667
  • CGT: Computational Graph Toolkit (computation graph and autodiff on multidimensional arrays algorithm, with C and Python API):
    • https://github.com/joschu/cgt/
    • http://rll.berkeley.edu/cgt/
  • PolyMage: Automatic Optimization for image processing pipelines:
    • Note:
      • Polymage use a Python DSL that compiles to C++, including OpenMP and ivdep pragma
      • It uses polyhedral optimization with several tiling algorithms (notable parallelogram and overlapped)
      • Author later refactored Halide auto-scheduler
    • http://mcl.csa.iisc.ac.in/polymage/asplos235-mullapudi.pdf
    • http://mcl.csa.iisc.ernet.in/downloads/slides/PolyMage.pdf
    • http://mcl.csa.iisc.ac.in/polymage.html
    • https://bitbucket.org/udayb/polymage/
  • SMIL / FREIA: A Dynamic to Static DSL Compiler for Image Processing Applications
    • https://www.cri.ensmp.fr/classement/doc/E-398-slides.pdf
    • https://www.cri.mines-paristech.fr/classement/doc/A-640.pdf
    • https://www.cri.mines-paristech.fr/classement/doc/A-670.pdf
  • DeepDSL (Scala): A Compilation-based Domain-Specific Language for Deep Learning
    • https://github.com/deepdsl/deepdsl
    • https://openreview.net/pdf?id=Bks8cPcxe

Auto-scheduling - Polyhedral tiling

  • The Promises of Hybrid Hexagonal/Classical Tiling for GPU: https://hal.inria.fr/hal-00848691/document
  • Applying the Polyhedral Model to Tile Loops in Devito: https://pdfs.semanticscholar.org/d0bd/7c5f51c80c3236e99009298734291745e2e7.pdf
  • Jie Zhao:
    • A Combined Language and Polyhedral Approach for Heterogeneous Parallelism: https://tel.archives-ouvertes.fr/tel-01988073/file/thesis.pdf
    • A General Purpose Automatic Overlapped Tiling Technique in Polyhedral Frameworks: https://www.di.ens.fr/~zhaojie/cgo2018-poster-presentation
  • Tiramisu: TIRAMISU: A Polyhedral Compiler for Expressing. Fast and Portable Code.
    • Note: supports RNN
    • https://arxiv.org/pdf/1804.10694.pdf
    • https://andreask.cs.illinois.edu/cs598apk-f18/talks/njchris2.pdf
    • http://tiramisu-compiler.org
    • https://github.com/Tiramisu-Compiler/tiramisu

Approaches to porting

Wrapping

I'm not really fan of wrapping unless we can do away with CMake and completely configure everything via Nim. This is the approach I took for the Arcade Learning Environment for reinforcement learning here with a Nim build script that basically replaces CMake and the DLL built can be called from another Nim: https://github.com/numforge/agent-smith/blob/master/third_party/ale_wrap.nim. This preserves C++ name mangling, puts everything in nimcache and can cleanly made into a nimble task.

The alternative is something like nimtorch which requires either a build script (conda in NimTorch case), potentially a docker or building the library apart and then copying it.

Once wrapped we need a way for Nim to generate valide Halide DSL, this means removing all the fluff Nim adds to a C/C++ file. Interestingly this is a very similar problem to generating Cuda kernel like what is done in Cudanim, this would also apply to OpenCL code generator idea from 2014, 2015, 2016

Reimplementing from scratch

As said in the beginning, from a learning perspective it's the most interesting though the time-to-market and the testing (+ fuzzing) would probably be years away.

Interestingly, it seems like Halide started as FImage and had it's own assembler. The IR looks already quite similar to what we have now after only 3 months: https://github.com/halide/Halide/tree/5a2f90b7bcf4d9298e4068698f52731401e07dc8.

There are 2 big parts of Halide from an engineering perspective:

  • The frontend, which converts the DSL to Halide IR.
  • A lowering step from Halide IR to LLVM IR.
  • The backend which uses LLVM to generate code for the different targets.

Skipping the front-end since that has to be implemented to discuss the backend.

Intermediate representation

For the IR there are several choices: Halide IR, Glow IR, TVM IR, SPIR-V or LLVM IR directly

LLVM backend

Assuming we want to support lots of framework, I don't see any alternative to LLVM for the backend both from a license, a platform and a maturity point of view which comes with it's own cost, notably compilation overhead and the space LLVM takes.

Regarding API, I'm not yet sure which one Halide uses but LLVM offers 2:

  • The stable MCJIT: https://llvm.org/docs/MCJITDesignAndImplementation.html
  • The new Orc JIT: https://llvm.org/docs/tutorial/BuildingAJIT1.html
Written from scratch backends
Jit backend

Assuming we target a couple specific platforms that shares a lot of similarity in the programming model (for example X86, ARM, Cuda, ROCm), this is also possible. Furthermore as you said, using Nim offers a tremendous metaprogramming advantage as evidenced by the very clean way i could implement my own JIT assembler in Laser:

https://github.com/numforge/laser/blob/56df4643530ada0a01d29116feb626d93ac11379/laser/photon_jit/x86_64/x86_64_ops.nim

op_generator:
  op MOV: # MOV(dst, src) load/copy src into destination
    ## Copy 64-bit register content to another register
    [dst64, src64]: [rex(w=1), 0x89, modrm(Direct, reg = src64, rm = dst64)]
    ## Copy 32-bit register content to another register
    [dst32, src32]: [          0x89, modrm(Direct, reg = src32, rm = dst32)]
    ## Copy 16-bit register content to another register
    [dst16, src16]: [    0x66, 0x89, modrm(Direct, reg = src16, rm = dst16)]

    ## Copy  8-bit register content to another register
    [dst8,  src8]:  [          0x88, modrm(Direct, reg = src8, rm = dst8)]

    ## Copy 64-bit immediate value into register
    [dst64, imm64]: [rex(w=1), 0xB8 + dst64] & imm64
    ## Copy 32-bit immediate value into register
    [dst64, imm32]: [          0xB8 + dst64] & imm32
    ## Copy 16-bit immediate value into register
    [dst64, imm16]: [    0x66, 0xB8 + dst64] & imm16

    ## Copy 32-bit immediate value into register
    [dst32, imm32]: [          0xB8 + dst32] & imm32
    ## Copy 16-bit immediate value into register
    [dst32, imm16]: [    0x66, 0xB8 + dst32] & imm16

    ## Copy 16-bit immediate value into register
    [dst16, imm16]: [    0x66, 0xB8 + dst16] & imm16
    ## Copy  8-bit immediate value into register
    [dst8,  imm8]:  [          0xB0 + dst8, imm8]

  op LEA:
    ## Load effective address of the target label into a register
    [dst64, label]: [rex(w=1), 0x8D, modrm(Direct, reg = dst64, rm = rbp)]

  op CMP:
    ## Compare 32-bit immediate with 32-bit int at memory location stored in adr register
    [adr, imm64]: [ rex(w=1), 0x81, modrm(Indirect, opcode_ext = 7, rm = adr[0])] & imm64
    ## Compare 32-bit immediate with 32-bit int at memory location stored in adr register
    [adr, imm32]: [           0x81, modrm(Indirect, opcode_ext = 7, rm = adr[0])] & imm32
    ## Compare 16-bit immediate with 16-bit int at memory location stored in adr register
    [adr, imm16]: [     0x66, 0x81, modrm(Indirect, opcode_ext = 7, rm = adr[0])] & imm16
    ## Compare 8-bit immediate with byte at memory location stored in adr register
    [adr, imm8]:  [           0x80, modrm(Indirect, opcode_ext = 7, rm = adr[0]), imm8]

  op JZ:
    ## Jump to label if zero flag is set
    [label]: [0x0F, 0x84]
  op JNZ:
    ## Jump to label if zero flag is not set
    [label]: [0x0F, 0x85]

JIT is the approach taken for x86-only libs:

  • Intel libxsmm - JIT from scratch: https://github.com/hfp/libxsmm
  • MKLDNN - Xbyak JIT: https://github.com/intel/mkl-dnn
  • PyTorch FBGEMM - asmjit

There are significant challenges though:

  • Making that robust (need fuzzing)
  • Caching (Intel and AMD are still not caching their JIT-ted OpenCL kernels for reference)
  • Thread-safe
  • Register allocations
  • Optimizing
AOT backend

We can also do ahead of time compilation with runtime dispatch depending on the target. This is the approach I take with Laser GEMM with a version for SSE, SSE2, SSE4.1, AVX, AVX+FMA, AVX2, AVX512, ... and we can add Cuda, OpenCL, etc.

mratsim avatar Jan 31 '19 13:01 mratsim

note that we can generate LLVM IR from Nim (or a DSL if we have a macro DSL=>nim); see this demo https://github.com/timotheecour/vitanim/blob/master/testcases/tests/t0127.nim shows how we can generate LLVM IR from a DSL

  • the DSL could be either pure nim:
DSLtoLLVMIR:
  proc entryPoint() {.exportc.} =
    for i in 0..<10:
      echo i*i
  • or something custom for image processing, as you were referring to, eg
DSLtoLLVMIR: # front-end that converts DSL to nim is needed here
  forEach x in a, y in b, z in c:
    z = x + y

timotheecour avatar Feb 01 '19 10:02 timotheecour

I find the need to go through a text file quite clunky and slow in the long term.

Since LLVM has a bit code representation of its IR I was thinking of targeting that directly using bindings akin to LLVM-D.

This is also what Halide is doing..

From an interface point of view, I probably can re-use the JIT Assembler for x86-64 I created here except that instead of assembling x86 I'm assembling LLVM IR. (Usage).

It can be restricted to the subset valid for Nvidia targets.

Alternatively, it seems like we may be able to directly JIT an AST representation without having to go through LLVM IR phase.

mratsim avatar Feb 01 '19 14:02 mratsim

Some advances in my research (will find a place to store that in the future, probably in Laser)

Code generation

  • Kai Nacke, maintainer of LDC (LLVM D Compiler), did a very nice talk about what he learned by implementing his own compiler from scratch at FOSDEM 2019. Slides.

Alternative HPC compiler

  • http://okmij.org/ftp/meta-programming/HPC.html and the Shonan challenge, several HPC problems with a known tuned manual solution that would be nice to automatically optimize:
    • Paper - test cases: http://okmij.org/ftp/meta-programming/Shonan-challenge.pdf
    • Report: http://okmij.org/ftp/meta-programming/shonan2-report.pdf
    • 1D Conv optimisation in OCaml: http://okmij.org/ftp/meta-programming/lift/convolution.ml

Semantic pass and the expression problem

Halide and PyTorch Glow are using the visitor pattern which looks to me like a workaround for a not expressive enough language.

We need to be able to extend functions without touching the core ones and if possible data types as well.

For example we could have the functions Add and Mul defined on Tensor and CudaTensor today and a user library implement Div and also extend Add and Mul on VulkanTensor.

Data types are probably not that critical, i.e. we could use ADTs + Generics for Tensor[float32], VulkanTensor[float16], ...

In research, there are several approaches to handle that:

mratsim avatar Feb 06 '19 12:02 mratsim