QuantumClifford.jl
QuantumClifford.jl copied to clipboard
Kernel Fusion for apply_batch!
On GPU there is an overhead whenever we call a new kernel. In pftrajectories
function we are running a kernel for each gate.
In the GPU kernel that we run for each gate, each thread takes care of one row in the Pauli Frame. This suggests that fusing all those kernels into one kernel would give a potential improvement.
The idea is to have a method apply_row
implemented for each gate and then we can have our fusion kernel like this:
function batch_kernel(xzs::AbstractMatrix{Tme}, phases::AbstractArray{Tmz}, gate_batch, rows::Unsigned, compute_phases::Bool) where {Tme <: Unsigned, Tmz <: Unsigned}
r = (blockIdx().x - 1) * blockDim().x + threadIdx().x
if r > rows
return nothing
end
for gate in gate_batch
apply_row!(xzs, phases, r, gate, compute_phases)
end
return nothing
end
The initial version is implemented here
potential improvement
- potentially improves the time as we are only running one kernel not
num_gates
kernels. -
apply_row!
provides an abstraction for us over cpu and gpu implementation. so if we refactor the code in the future and haveapply_row!
for all gates (and make them@inline
) we can use this in both CPU and GPU (so we don't have the duplicate code for each single gate in the GPU support.
issue 1
the kernel has to be compiled for each different type of gate_batch which is exponentially large (gates ^ n). We will resolve this issue using SumTypes in the future.
issue 2
gpu kernels do not allow variable indexing or iterating over a tuple. so for gate in gate_batch
will produce the error below:
ERROR: InvalidIRError: compiling MethodInstance for batch_kernel resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to ijl_get_nth_field_checked)
In order to fix this we try to pass the gates in the form of a nested tuple Tuple{G1, Tuple{G2, Tuple{G3, ...}}}
and then iterate over them like a linked list. This seem to work when number of gates are less than 4. but still produces the same error as above (call to ijl_get_nth_field_checked
)
currently, I am trying to do some benchmarking by creating a kernel for a big (let's say 100) batch and see how much improvement it shows.
apply_row!(xzs, phases, r, gate_batch[1], compute_phases)
apply_row!(xzs, phases, r, gate_batch[2], compute_phases)
apply_row!(xzs, phases, r, gate_batch[3], compute_phases)
apply_row!(xzs, phases, r, gate_batch[4], compute_phases)
.
.
.
Then we can use generated functions to achieve the same result.
After the experiment, turns out that the overhead of calling function apply_row! is a lot more than I expected.
The code is available here (the main function is profile_batched_kernel
)
Note:
I had to set batch size to 200 because increasing batch size would increase the size of the kernel (because of unrolling code) and there is a limit on how large the kernel size can be.
Each batched kernel (batch of 200) is about 7 milliseconds.
Each single kernel is about 30 microseconds.
--
FIX
After adding @inline
properly, the overhead of calling the apply_row
function totally disappeared. Code here.
Now, calling apply_single!
(which uses apply_row
) is effectively as fast as a normal kernel call. Batching them makes it extremely fast. Using batches of 200 improved the performance of the kernel by a factor of 2.
batched case: 110ms
normal case: 257ms
Note:
We have not used SumTypes (CompactifiedGates) here. Thus for each batch of 200 gates, a new kernel should be compiled (which is terrible). For the next step, we should use SumTypes and compactified gates so that we compile the kernel only once.
However, note that calling apply_row!(state, compactified_gate)
constructs the actual gate and makes another apply_row
call. In order to avoid the overhead of this function call, both of them should somehow be inlined.