QuantumClifford.jl icon indicating copy to clipboard operation
QuantumClifford.jl copied to clipboard

Parallel execution of gates

Open Shayan-P opened this issue 11 months ago • 0 comments

idea

Assume that our GPU is capable of executing kernels of gates in parallel (running each on a different stream), or we have access to multiple GPUs and we want to utilize their computation power.

There is a limiting factor that we should run the gates of our circuit one by one. This does not allow us to run gates in parallel. However, if gate1 and gate2 do not affect any common memory, we can run them in parallel.

Here, I suggest an algorithm to partition our circuit (consisting of $U_1, U_2, ..., U_n$) into batches $k$ batches in such a way that the gates in each batch do not affect any shared memory, thus we can run all the gates in each batch in parallel. So effectively, applying $U_1, ..., U_n$ is equivalent to applying $Batch_1, Batch_2, ..., Batch_k$ one after another (but each batch in parallel).

algorithm

Assume there is a directed graph in which we have a direct edge from $i$ to $j$ ($i < j$) where $U_i$ and $U_j$ affect some common memory (in our case, each gate acts on some columns of the Tabluea, so we have to check if there is any common column).

This directed graph is topologically sorted. We just need to partition it into independent sets which can trivially be done in $O(n)$.

  • While there is any node remaining
  • Put the nodes that don't have any incoming edge in one batch and then delete them.

The code below is the implementation of the idea above:

function partition_circuit(::Type{T}, circ) where {T <: Unsigned}
    # time complexity: O(gates * qubits/64) for UInt64

    circ = if eltype(circ) <: QuantumClifford.CompactifiedGate
        circ
    else
        compactify_circuit(circ)
    end

    qmax = maximum((maximum(affectedqubits(g)) for g in circ))
    columns_cnt = QuantumClifford._div(T, qmax) + 1

    partitions = []
    partition_idx = []
    last_column_occupier = [0 for _ in 1:columns_cnt] # last_column_occupier[i] = group id of the last gate that occupies indice i

    for g in circ
        column_indices = [QuantumClifford.getbigindex(T, qbit) for qbit in affectedqubits(g)]
        my_partition_idx = maximum(last_column_occupier[column_indices]) + 1
        push!(partition_idx, my_partition_idx)
        if my_partition_idx > length(partitions)
            push!(partitions, [g])
        else
            push!(partitions[my_partition_idx], g)
        end
        last_column_occupier[column_indices] .= my_partition_idx
    end
    partitions
end

And then, we would this partitioning like this:

function pftrajectories(state::PauliFrameGPU{T}, circuit) where {T <: Unsigned}
    for gates in partition_circuit(T, circuit)
        @sync begin
            for gate in gates
                @CUDA.async apply!(state, gate)
            end
        end
    end
    return state
end

issue

My GPU was not able to run two tasks in parallel (it would initiate different streams for each async call but eventually they would get run sequentially) thus it showed no improvement.

Shayan-P avatar Aug 27 '23 00:08 Shayan-P