AMDMIGraphX Reuse inner buffers when generating kernels for blockwise reductions

Due to the use of __syncthreads in the reduce methods registers are not reused. We can reuse them directly by assigning to them with r.inner([](auto& y, auto x) { y = f(x); })(inner_buffer, input) instead of doing auto inner_buffer = r.inner([](auto x) { return f(x); })(input).

We will need to do some form of graph coloring to reuse the registers when possible. We could lower the pointwise ops to a special inner op that takes an output buffer to write. We can insert an allocation for this output buffer and then graph coloring can select the buffers to reuse.

I am not sure the best way to do the codegen. We could use some kind of a nullary inner method for the allocate instructions, but I am not sure how it would get the size. Another way would be to make allocate a no-op, and then if the output buffer instruction is an allocation just generate an inner method that returns instead of assigning, although graph coloring might point to an allocation instruction instead of the pointwise instruction.

Jun 13 '24 03:06 pfultz2

Thinkng about this more, we can reuse our memory_coloring pass and then just do some post processing.

So first we would lower the pointwise operators to an inner_pointwise that takes an allocate instruction. The size of allocation can just be the shape of the pointwise instruction(the size of the instruction is not really important as long as each one is the same size).

Then we run memory_coloing and then replace the inner_pointwise with the first load with a pointwise and the other loads we would replace to reference the first pointwise:

std::unordered_map<std::size_t, instruction_ref> load2ins;
for(auto ins:iterator_for(m))
{
    if (ins->name() != "gpu::inner_pointwise")
        continue;
    auto out = ins->inputs().back();
    auto inputs = ins->inputs();
    auto offset = out.to_value()["offset"].to<std::size_t>();
    if(contains(load2ins, offset))
    {
        auto i = load2ins[offset];
        inputs.back() = i;
        m.replace_instruction(ins, ins->get_operator(), inputs, ins->module_inputs());
    }
    else
    {
        load2ins[offset] = ins;
        inputs.pop_back();
        m.replace_instruction(ins, make_op("pointwise"), inputs, ins->module_inputs());
    }
}

We will need to update the codegen to handle aliased variables. So when an instruction aliases, instead of generating a return variable(ie auto zn = f(...)) instead it would generate the statement standalone(ie f(x)) and then update the mapping to refer to the original variable that is aliased.

There is still one issue that would need to be solved. If two inner_pointwise use different data types then we shouldn't reuse the buffers. We could possibly fix this by running memory_coloring multiple times for each type.

Jun 13 '24 15:06 pfultz2

Paul to check if this is still needed

Jul 25 '24 16:07 causten