QuantumClifford.jl Kernel Fusion for apply

Kernel Fusion for apply_batch!

Open Shayan-P opened this issue 1 year ago • 0 comments

On GPU there is an overhead whenever we call a new kernel. In pftrajectories function we are running a kernel for each gate.

In the GPU kernel that we run for each gate, each thread takes care of one row in the Pauli Frame. This suggests that fusing all those kernels into one kernel would give a potential improvement.

The idea is to have a method apply_row implemented for each gate and then we can have our fusion kernel like this:

function batch_kernel(xzs::AbstractMatrix{Tme}, phases::AbstractArray{Tmz}, gate_batch, rows::Unsigned, compute_phases::Bool) where {Tme <: Unsigned, Tmz <: Unsigned}
    r = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if r > rows
        return nothing
    end
    for gate in gate_batch
            apply_row!(xzs, phases, r, gate, compute_phases)
    end
    return nothing
end

The initial version is implemented here

potential improvement

potentially improves the time as we are only running one kernel not num_gates kernels.
apply_row! provides an abstraction for us over cpu and gpu implementation. so if we refactor the code in the future and have apply_row! for all gates (and make them @inline) we can use this in both CPU and GPU (so we don't have the duplicate code for each single gate in the GPU support.

issue 1

the kernel has to be compiled for each different type of gate_batch which is exponentially large (gates ^ n). We will resolve this issue using SumTypes in the future.

issue 2

gpu kernels do not allow variable indexing or iterating over a tuple. so for gate in gate_batch will produce the error below:

ERROR: InvalidIRError: compiling MethodInstance for batch_kernel resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to ijl_get_nth_field_checked)

In order to fix this we try to pass the gates in the form of a nested tuple Tuple{G1, Tuple{G2, Tuple{G3, ...}}} and then iterate over them like a linked list. This seem to work when number of gates are less than 4. but still produces the same error as above (call to ijl_get_nth_field_checked)

currently, I am trying to do some benchmarking by creating a kernel for a big (let's say 100) batch and see how much improvement it shows.

apply_row!(xzs, phases, r, gate_batch[1], compute_phases)
apply_row!(xzs, phases, r, gate_batch[2], compute_phases)
apply_row!(xzs, phases, r, gate_batch[3], compute_phases)
apply_row!(xzs, phases, r, gate_batch[4], compute_phases)
.
.
.

Then we can use generated functions to achieve the same result.

After the experiment, turns out that the overhead of calling function apply_row! is a lot more than I expected.

The code is available here (the main function is profile_batched_kernel)

Note:

I had to set batch size to 200 because increasing batch size would increase the size of the kernel (because of unrolling code) and there is a limit on how large the kernel size can be.

image (1)

image (2)

Each batched kernel (batch of 200) is about 7 milliseconds.

image (3)

Each single kernel is about 30 microseconds.

FIX

After adding @inline properly, the overhead of calling the apply_row function totally disappeared. Code here.

Now, calling apply_single! (which uses apply_row) is effectively as fast as a normal kernel call. Batching them makes it extremely fast. Using batches of 200 improved the performance of the kernel by a factor of 2.

batched case: 110ms

normal case: 257ms

Note:

We have not used SumTypes (CompactifiedGates) here. Thus for each batch of 200 gates, a new kernel should be compiled (which is terrible). For the next step, we should use SumTypes and compactified gates so that we compile the kernel only once. However, note that calling apply_row!(state, compactified_gate) constructs the actual gate and makes another apply_row call. In order to avoid the overhead of this function call, both of them should somehow be inlined.

Aug 18 '23 18:08 Shayan-P

QuantumClifford.jl QuantumClifford.jl copied to clipboard

Kernel Fusion for apply_batch!

potential improvement

issue 1

issue 2

Note:

FIX

Note:

QuantumClifford.jl
QuantumClifford.jl copied to clipboard