discussion: porting Halide to nim

Open timotheecour opened this issue 5 years ago • 4 comments

opening this issue to track ideas for how to port Halide to nim. This could be either using halide in Nim via:

A1: a wrapper (eg using nimterop) or
A2: re-implementing some ideas in nim directly, which although a lot of work would result in a much simpler solution given nim's better DSL abilities

proposal: approach 1

use a plugin at CT approach as done in nimterop (https://github.com/genotrance/nimterop/pull/66) to transparently compile (using halide) a c++/Halide DSL, save to a shared library and cache it; eg:

# this is in Nim

import halide_funs

cPluginHalide: """ // this is compiled on the fly and cached, using a hash of the string, eg saved to: /tmp/nimhalide/libplugin_12abf23.dylib
// this is C++ halide code
Func blur_3x3(Func input) { // example from https://en.wikipedia.org/wiki/Halide_(programming_language)
  Func blur_x, blur_y;
  Var x, y, xi, yi;

  // The algorithm - no storage or order
  blur_x(x, y) = (input(x-1, y) + input(x, y) + input(x+1, y))/3;
  blur_y(x, y) = (blur_x(x, y-1) + blur_x(x, y) + blur_x(x, y+1))/3;

  // The schedule - defines order, locality; implies storage
  blur_y.tile(x, y, xi, yi, 256, 32)
        .vectorize(xi, 8).parallel(y);
  blur_x.compute_at(blur_y, x).vectorize(x, 8);

  return blur_y;
}"""

proc main() =
  let image = loadImage("img.jpg")
  blur_3x3(image)

[EDIT] proposal: approach 2

using https://github.com/fragcolor-xyz/nimline could be another possiblity, but would probalby need adaptation as halide compiler needs to be run on C++/halide program

halide: https://en.wikipedia.org/wiki/Halide_(programming_language)
some discussion on gitter https://gitter.im/Arraymancer/Lobby?at=5c4f1b7d13a2814df6ded8f5 + a few hours before that
https://github.com/mratsim/Arraymancer/issues/203 briefly mentions halide but wasn't the focus of that issue
other language integrations:
- python: https://stackoverflow.com/questions/34439657/halide-with-c-layout-numpy-arrays
- rust: Rust Image Processing DSL – dpzmick.com
tutorial http://halide-lang.org/tutorials/tutorial_lesson_16_rgb_generate.html
nlvm: https://github.com/arnetheduck/nlvm

Jan 28 '19 20:01 timotheecour

As discussed, this sounds like a very interesting idea and from a learning perspective I'd rather implement it from scratch, just like Arraymancer started as a way to learn NN from scratch instead of reusing Tensorflow or Torch.

Overview of Halide

Halide is a C++ domain specific language for computational image processing. It started with the idea of separating the algorithm (the what) from its efficient hardware implementation (the how). This is similar to a (functional programming) source file vs machine code when passed through an optimizing compiler.

The "how" can be hinted to assist the compiler in its threading and vectorization by specifying cache blocking parameters, unrolling factor and threading hints.

Furthermore Halide supports multiple backends (x86, ARM, Vulkan, DX12, OpenGL, OpenCL, Cuda ...) including a JIT backend that will dispatch with the latest instruction set supported (for example AVX512).

The pipeline is: C++ DSL --> Halide Intermediate Representation (Computational Graph) --> LLVM (AOT or JIT) --> machine code.

## Advantages

In general

Short code, much faster to develop new efficient algorithms.
No need to fiddle with platform specific optimisations like SIMD.
Portability
Computation can be fused across functions and more parallelism can be exposed.
Much easier to test different parallel schedule

For ML and DL

The main advantages are the portability with SIMD optimisations opportunities and computation fusion to prevent intermediate allocation and results.

Parallelism in ML/DL is either straightforward (it's a single loop) or very complex and would need manual involvement:

~For example Halide is memory-bound on matrix multiplication (0.5~1 TFLOP/s at most while I can reach 1.8TFLOP/s with OpenBLAS or MKLDNN)~ Perf reporting in Halide is confusing, I revamped it and it reaches MKL perf on single socket CPUs.
Halide is unsuited for trees ensembles like random forests

Some DL layers that would benefits from Halide-like functionality:

Convolution (apparently Halide has a schedule that reaches MKL-DNN performance)
Locally-Connected Layer/Unshared Convolution (#255). Though this might be faster with summed-area table/integral image and then a simple for loop.
Bilinear
Pooling layers (Max, Average, Region of Interest Pooling)

Disadvantages

Halide is pretty much a static graph compiler like Tensorflow XLA which is not that great for research and debugging. It's a pain to use Tensorflow for NLP, RNNs and GANs for example.

The main issue is the lack of control flow like for loop and if/else branches.

Note that OpenAI Dali (not to be confused with Nvidia DALI) apparently is a compiler that plugs in dynamic graphs framework.

Another disadvantage is the LLVM dependency at runtime, but that only concerns JIT. LLVM is a huge library.

Halide papers

Jonathan Ragan-Kelley thesis: https://people.csail.mit.edu/jrk/jrkthesis.pdf
Halide a language and compiler for optimising, parallelism, locality and recomputation in image processing pipelines: http://people.csail.mit.edu/jrk/halide-pldi13.pdf
Compiling high performance recursive filters: http://www-sop.inria.fr/reves/Basilic/2015/CRPDD15a/recfilter.pdf
CVPR slides - recursive filtering: https://halide-lang.org/assets/lectures/CVPR_Halide_RecursiveFiltering.pdf
Parallel Associative Reduction in Halide: https://andrew.adams.pub/par.pdf
Automatically Scheduling Halide Image Processing Pipelines: http://graphics.cs.cmu.edu/projects/halidesched/mullapudi16_halidesched.pdf

Beyond Halide

Halide has been expanded with autodiff capabilities in gradient-halide which may greatly enhance autograd which is a recurrent source of slowness in Arraymancer due to the need of a DAG which doesn't play well with reference counting. However while gradient-halide results are promising, TVM (a fork of Halide for DL) reported perf issues

Related projects

Tensorflow XLA: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/g3doc/overview.md
Tensorflow MLIR: https://github.com/tensorflow/mlir
PyTorch Glow: https://github.com/pytorch/glow
Dask: https://dask.org
Gradient Halide: https://people.csail.mit.edu/tzumao/gradient_halide/
OpenAI Dali (can't find the repo): http://learningsys.org/nips18/assets/papers/100CameraReadySubmissionDali__Lazy_Compilation_of_Dynamic_Computation_Graphs.pdf
DLMC TVM: https://github.com/dmlc/tvm
Intel PlaidML: https://github.com/plaidml/plaidml
Intel Nervana Ngraph: https://github.com/NervanaSystems/ngraph/tree/master/python
The Tensor Algebra Compiler: https://github.com/tensor-compiler/taco
Numba: http://numba.pydata.org (less related as it doesn't compile Expression Graphs)
PyGPU (2006): An embedded DSL for high speed image processing
- Techniques for implementing embedded domain specific languages in dynamic languages
- High-speed GPU Programming (thesis)
Stanford's Delite and OptiML (2011):
- Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages
- An heterogeneous parellel framework for DSL
- slides
- OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning
- Building Blocks for Performance oriented DSL
- http://stanford-ppl.github.io/Delite/optiml/index.html
- https://github.com/stanford-ppl/hyperdsl
DLVM Deep Learning Virtual Machine (2017 - Authors migrated to Swift for Tensorflow)
- https://arxiv.org/pdf/1711.03016.pdf
- https://github.com/dlvm-team/dlvm-core
TriNNity:
- https://www.scss.tcd.ie/~andersan/projects/live/triNNity.html
- https://bitbucket.org/STG-TCD/trinnity/src/master/
Graphi, scheduling deep learning computation graphs on manycore CPU:
- https://arxiv.org/abs/1807.09667
CGT: Computational Graph Toolkit (computation graph and autodiff on multidimensional arrays algorithm, with C and Python API):
- https://github.com/joschu/cgt/
- http://rll.berkeley.edu/cgt/
PolyMage: Automatic Optimization for image processing pipelines:
- Note:
  - Polymage use a Python DSL that compiles to C++, including OpenMP and ivdep pragma
  - It uses polyhedral optimization with several tiling algorithms (notable parallelogram and overlapped)
  - Author later refactored Halide auto-scheduler
- http://mcl.csa.iisc.ac.in/polymage/asplos235-mullapudi.pdf
- http://mcl.csa.iisc.ernet.in/downloads/slides/PolyMage.pdf
- http://mcl.csa.iisc.ac.in/polymage.html
- https://bitbucket.org/udayb/polymage/
SMIL / FREIA: A Dynamic to Static DSL Compiler for Image Processing Applications
- https://www.cri.ensmp.fr/classement/doc/E-398-slides.pdf
- https://www.cri.mines-paristech.fr/classement/doc/A-640.pdf
- https://www.cri.mines-paristech.fr/classement/doc/A-670.pdf
DeepDSL (Scala): A Compilation-based Domain-Specific Language for Deep Learning
- https://github.com/deepdsl/deepdsl
- https://openreview.net/pdf?id=Bks8cPcxe

Auto-scheduling - Polyhedral tiling

The Promises of Hybrid Hexagonal/Classical Tiling for GPU: https://hal.inria.fr/hal-00848691/document
Applying the Polyhedral Model to Tile Loops in Devito: https://pdfs.semanticscholar.org/d0bd/7c5f51c80c3236e99009298734291745e2e7.pdf
Jie Zhao:
- A Combined Language and Polyhedral Approach for Heterogeneous Parallelism: https://tel.archives-ouvertes.fr/tel-01988073/file/thesis.pdf
- A General Purpose Automatic Overlapped Tiling Technique in Polyhedral Frameworks: https://www.di.ens.fr/~zhaojie/cgo2018-poster-presentation
Tiramisu: TIRAMISU: A Polyhedral Compiler for Expressing. Fast and Portable Code.
- Note: supports RNN
- https://arxiv.org/pdf/1804.10694.pdf
- https://andreask.cs.illinois.edu/cs598apk-f18/talks/njchris2.pdf
- http://tiramisu-compiler.org
- https://github.com/Tiramisu-Compiler/tiramisu

Approaches to porting

Wrapping

I'm not really fan of wrapping unless we can do away with CMake and completely configure everything via Nim. This is the approach I took for the Arcade Learning Environment for reinforcement learning here with a Nim build script that basically replaces CMake and the DLL built can be called from another Nim: https://github.com/numforge/agent-smith/blob/master/third_party/ale_wrap.nim. This preserves C++ name mangling, puts everything in nimcache and can cleanly made into a nimble task.

The alternative is something like nimtorch which requires either a build script (conda in NimTorch case), potentially a docker or building the library apart and then copying it.

Once wrapped we need a way for Nim to generate valide Halide DSL, this means removing all the fluff Nim adds to a C/C++ file. Interestingly this is a very similar problem to generating Cuda kernel like what is done in Cudanim, this would also apply to OpenCL code generator idea from 2014, 2015, 2016

Reimplementing from scratch

As said in the beginning, from a learning perspective it's the most interesting though the time-to-market and the testing (+ fuzzing) would probably be years away.

Interestingly, it seems like Halide started as FImage and had it's own assembler. The IR looks already quite similar to what we have now after only 3 months: https://github.com/halide/Halide/tree/5a2f90b7bcf4d9298e4068698f52731401e07dc8.

There are 2 big parts of Halide from an engineering perspective:

The frontend, which converts the DSL to Halide IR.
A lowering step from Halide IR to LLVM IR.
The backend which uses LLVM to generate code for the different targets.

Skipping the front-end since that has to be implemented to discuss the backend.

Intermediate representation

For the IR there are several choices: Halide IR, Glow IR, TVM IR, SPIR-V or LLVM IR directly

LLVM backend

Assuming we want to support lots of framework, I don't see any alternative to LLVM for the backend both from a license, a platform and a maturity point of view which comes with it's own cost, notably compilation overhead and the space LLVM takes.

Regarding API, I'm not yet sure which one Halide uses but LLVM offers 2:

The stable MCJIT: https://llvm.org/docs/MCJITDesignAndImplementation.html
The new Orc JIT: https://llvm.org/docs/tutorial/BuildingAJIT1.html

Written from scratch backends

Jit backend

Assuming we target a couple specific platforms that shares a lot of similarity in the programming model (for example X86, ARM, Cuda, ROCm), this is also possible. Furthermore as you said, using Nim offers a tremendous metaprogramming advantage as evidenced by the very clean way i could implement my own JIT assembler in Laser:

https://github.com/numforge/laser/blob/56df4643530ada0a01d29116feb626d93ac11379/laser/photon_jit/x86_64/x86_64_ops.nim

op_generator:
  op MOV: # MOV(dst, src) load/copy src into destination
    ## Copy 64-bit register content to another register
    [dst64, src64]: [rex(w=1), 0x89, modrm(Direct, reg = src64, rm = dst64)]
    ## Copy 32-bit register content to another register
    [dst32, src32]: [          0x89, modrm(Direct, reg = src32, rm = dst32)]
    ## Copy 16-bit register content to another register
    [dst16, src16]: [    0x66, 0x89, modrm(Direct, reg = src16, rm = dst16)]

    ## Copy  8-bit register content to another register
    [dst8,  src8]:  [          0x88, modrm(Direct, reg = src8, rm = dst8)]

    ## Copy 64-bit immediate value into register
    [dst64, imm64]: [rex(w=1), 0xB8 + dst64] & imm64
    ## Copy 32-bit immediate value into register
    [dst64, imm32]: [          0xB8 + dst64] & imm32
    ## Copy 16-bit immediate value into register
    [dst64, imm16]: [    0x66, 0xB8 + dst64] & imm16

    ## Copy 32-bit immediate value into register
    [dst32, imm32]: [          0xB8 + dst32] & imm32
    ## Copy 16-bit immediate value into register
    [dst32, imm16]: [    0x66, 0xB8 + dst32] & imm16

    ## Copy 16-bit immediate value into register
    [dst16, imm16]: [    0x66, 0xB8 + dst16] & imm16
    ## Copy  8-bit immediate value into register
    [dst8,  imm8]:  [          0xB0 + dst8, imm8]

  op LEA:
    ## Load effective address of the target label into a register
    [dst64, label]: [rex(w=1), 0x8D, modrm(Direct, reg = dst64, rm = rbp)]

  op CMP:
    ## Compare 32-bit immediate with 32-bit int at memory location stored in adr register
    [adr, imm64]: [ rex(w=1), 0x81, modrm(Indirect, opcode_ext = 7, rm = adr[0])] & imm64
    ## Compare 32-bit immediate with 32-bit int at memory location stored in adr register
    [adr, imm32]: [           0x81, modrm(Indirect, opcode_ext = 7, rm = adr[0])] & imm32
    ## Compare 16-bit immediate with 16-bit int at memory location stored in adr register
    [adr, imm16]: [     0x66, 0x81, modrm(Indirect, opcode_ext = 7, rm = adr[0])] & imm16
    ## Compare 8-bit immediate with byte at memory location stored in adr register
    [adr, imm8]:  [           0x80, modrm(Indirect, opcode_ext = 7, rm = adr[0]), imm8]

  op JZ:
    ## Jump to label if zero flag is set
    [label]: [0x0F, 0x84]
  op JNZ:
    ## Jump to label if zero flag is not set
    [label]: [0x0F, 0x85]

JIT is the approach taken for x86-only libs:

Intel libxsmm - JIT from scratch: https://github.com/hfp/libxsmm
MKLDNN - Xbyak JIT: https://github.com/intel/mkl-dnn
PyTorch FBGEMM - asmjit

There are significant challenges though:

Making that robust (need fuzzing)
Caching (Intel and AMD are still not caching their JIT-ted OpenCL kernels for reference)
Thread-safe
Register allocations
Optimizing

AOT backend

We can also do ahead of time compilation with runtime dispatch depending on the target. This is the approach I take with Laser GEMM with a version for SSE, SSE2, SSE4.1, AVX, AVX+FMA, AVX2, AVX512, ... and we can add Cuda, OpenCL, etc.

Jan 31 '19 13:01 mratsim

note that we can generate LLVM IR from Nim (or a DSL if we have a macro DSL=>nim); see this demo https://github.com/timotheecour/vitanim/blob/master/testcases/tests/t0127.nim shows how we can generate LLVM IR from a DSL

the DSL could be either pure nim:

DSLtoLLVMIR:
  proc entryPoint() {.exportc.} =
    for i in 0..<10:
      echo i*i

or something custom for image processing, as you were referring to, eg

DSLtoLLVMIR: # front-end that converts DSL to nim is needed here
  forEach x in a, y in b, z in c:
    z = x + y

Feb 01 '19 10:02 timotheecour

I find the need to go through a text file quite clunky and slow in the long term.

Since LLVM has a bit code representation of its IR I was thinking of targeting that directly using bindings akin to LLVM-D.

This is also what Halide is doing..

From an interface point of view, I probably can re-use the JIT Assembler for x86-64 I created here except that instead of assembling x86 I'm assembling LLVM IR. (Usage).

It can be restricted to the subset valid for Nvidia targets.

Alternatively, it seems like we may be able to directly JIT an AST representation without having to go through LLVM IR phase.

Feb 01 '19 14:02 mratsim

Some advances in my research (will find a place to store that in the future, probably in Laser)

Code generation

Kai Nacke, maintainer of LDC (LLVM D Compiler), did a very nice talk about what he learned by implementing his own compiler from scratch at FOSDEM 2019. Slides.

Alternative HPC compiler

http://okmij.org/ftp/meta-programming/HPC.html and the Shonan challenge, several HPC problems with a known tuned manual solution that would be nice to automatically optimize:
- Paper - test cases: http://okmij.org/ftp/meta-programming/Shonan-challenge.pdf
- Report: http://okmij.org/ftp/meta-programming/shonan2-report.pdf
- 1D Conv optimisation in OCaml: http://okmij.org/ftp/meta-programming/lift/convolution.ml

Semantic pass and the expression problem

Halide and PyTorch Glow are using the visitor pattern which looks to me like a workaround for a not expressive enough language.

We need to be able to extend functions without touching the core ones and if possible data types as well.

For example we could have the functions Add and Mul defined on Tensor and CudaTensor today and a user library implement Div and also extend Add and Mul on VulkanTensor.

Data types are probably not that critical, i.e. we could use ADTs + Generics for Tensor[float32], VulkanTensor[float16], ...

In research, there are several approaches to handle that:

Tagless Final, embedding DSL into a typed language: http://okmij.org/ftp/tagless-final/ and http://okmij.org/ftp/tagless-final/course/lecture.pdf
Object Algebra: http://ponies.io/posts/2015-07-15-solving-the-expression-problem-in-python-object-algebras-and-mypy-static-types.html
Attribute Grammar (needs lazy eval): - https://www.reddit.com/r/haskell/comments/6ugzhe/walking_the_ast_without_the_visitor_pattern/ - https://play.rust-lang.org/?gist=1f7e5a01ec3d0010c50ecbccf93ad66b&version=nightly - Functional Implementation of Multiple Traversals Program with Attribute Grammars in Scala
Catamorphisms: https://medium.com/@olxc/catamorphisms-and-f-algebras-b4e91380d134
Transducers: http://sixty-north.com/blog/deriving-transducers-from-first-principles.html
Visitor vs Interpreter Pattern: https://homepages.cwi.nl/~storm/publications/visitor.pdf
Datatypes à la carte: http://www.cs.ru.nl/~W.Swierstra/Publications/DataTypesALaCarte.pdf
Functional Lenses / Multiplates: http://de.arxiv.org/ftp/arxiv/papers/1103/1103.2841.pdf
Monads (and Free Monads?):
- Generic Monadic Construcs for Embedded Languages
Object Algebra, traversals, visitor patterns, church encoding: paper and slides.
Developping extensible shallow embeddings (functions in host lang) on top of deep embeddings (fixed AST/IR):

Feb 06 '19 12:02 mratsim

Arraymancer Arraymancer copied to clipboard

discussion: porting Halide to nim

proposal: approach 1

[EDIT] proposal: approach 2

related

Overview of Halide

In general

For ML and DL

Disadvantages

Halide papers

Beyond Halide

Related projects

Auto-scheduling - Polyhedral tiling

Approaches to porting

Wrapping

Reimplementing from scratch

Intermediate representation

LLVM backend

Written from scratch backends

Jit backend

AOT backend

Code generation

Alternative HPC compiler

Semantic pass and the expression problem

Arraymancer
Arraymancer copied to clipboard