ocannl Rename `Assignments.to_low_level` to `reference_compile`, and introduce `cpu_friendly_compile` (later also `cuda

Implement as many optimizations as reasonable from these posts:

Fast Multidimensional Matrix Multiplication on CPU from Scratch
How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog The first in Assignments.cpu_friendly_compile, the second in Assignments.cuda_friendly_compile.

The optimization to start with, is reordering the iteration (i.e. nesting of the resulting for loops), for example to maximize the lexicographic preference: number of arrays where the rightmost axis has the innermost iterator, where the rightmost axis has the next-to-innermost iterator, the next-to-rightmost axis has the innerpost iterator, the next-to-rightmost axis has the next-to-innerpost iterator, ...

Some optimizations will require knowing the properties of the Ops.binary_op (and Ops.unary_op) involved, e.g. associativity, commutativity, distributivity (one op distributes over another). The properties actually needed should be represented directly in Ops.

Oct 01 '23 19:10 lukstafi

With new nomenclature, reference_lower, cpu_friendly_lower, cuda_friendly_lower.

Jul 15 '24 21:07 lukstafi

I don't think this is the right approach right now. There will be generic optimizations: reordering loop nesting for data locality, tiling. They can start with an already lowered representation. If it indeed would turn out it's easier to do it in one pass, just pass a config to to_low_level.

Sep 20 '24 09:09 lukstafi

Rename `Assignments.to_low_level` to `reference_compile`, and introduce `cpu_friendly_compile` (later also `cuda_friendly`)