Rename `Assignments.to_low_level` to `reference_compile`, and introduce `cpu_friendly_compile` (later also `cuda_friendly`)
Implement as many optimizations as reasonable from these posts:
- Fast Multidimensional Matrix Multiplication on CPU from Scratch
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
The first in
Assignments.cpu_friendly_compile, the second inAssignments.cuda_friendly_compile.
The optimization to start with, is reordering the iteration (i.e. nesting of the resulting for loops), for example to maximize the lexicographic preference: number of arrays where the rightmost axis has the innermost iterator, where the rightmost axis has the next-to-innermost iterator, the next-to-rightmost axis has the innerpost iterator, the next-to-rightmost axis has the next-to-innerpost iterator, ...
Some optimizations will require knowing the properties of the Ops.binary_op (and Ops.unary_op) involved, e.g. associativity, commutativity, distributivity (one op distributes over another). The properties actually needed should be represented directly in Ops.
With new nomenclature, reference_lower, cpu_friendly_lower, cuda_friendly_lower.
I don't think this is the right approach right now. There will be generic optimizations: reordering loop nesting for data locality, tiling. They can start with an already lowered representation. If it indeed would turn out it's easier to do it in one pass, just pass a config to to_low_level.