tqec Improve detector computation

Is your feature request related to a problem? Please describe.

Right now, detector computation is, by far, the most costly part of the circuit generation pipeline. The following image summarizes the time taken by parts of the tqec library to generate the circuit for a logical CNOT and k=4 using

import argparse
from pathlib import Path

import stim

from tqec.circuit.detectors.construction import annotate_detectors_automatically
from tqec.compile.compile import CompiledGraph, compile_block_graph
from tqec.compile.specs.library.css import CSS_SPEC_RULES
from tqec.compile.substitute import DEFAULT_SUBSTITUTION_RULES
from tqec.computation.block_graph import BlockGraph
from tqec.computation.zx_graph import ZXGraph
from tqec.noise_models import NoiseModel
from tqec.position import Position3D

BENCHMARK_FOLDER = Path(__file__).resolve().parent
TQEC_FOLDER = BENCHMARK_FOLDER.parent
ASSETS_FOLDER = TQEC_FOLDER / "assets"
CNOT_DAE_FILE = ASSETS_FOLDER / "logical_cnot.dae"


def create_block_graph(from_scratch: bool = False) -> BlockGraph:
    if not from_scratch:
        return BlockGraph.from_dae_file(CNOT_DAE_FILE)

    cnot_zx = ZXGraph("Logical CNOT ZX Graph")

    cnot_zx.add_z_node(Position3D(0, 0, 0))
    cnot_zx.add_x_node(Position3D(0, 0, 1))
    cnot_zx.add_z_node(Position3D(0, 0, 2))
    cnot_zx.add_z_node(Position3D(0, 0, 3))
    cnot_zx.add_x_node(Position3D(0, 1, 1))
    cnot_zx.add_z_node(Position3D(0, 1, 2))
    cnot_zx.add_z_node(Position3D(1, 1, 0))
    cnot_zx.add_z_node(Position3D(1, 1, 1))
    cnot_zx.add_z_node(Position3D(1, 1, 2))
    cnot_zx.add_z_node(Position3D(1, 1, 3))

    cnot_zx.add_edge(Position3D(0, 0, 0), Position3D(0, 0, 1))
    cnot_zx.add_edge(Position3D(0, 0, 1), Position3D(0, 0, 2))
    cnot_zx.add_edge(Position3D(0, 0, 2), Position3D(0, 0, 3))
    cnot_zx.add_edge(Position3D(0, 0, 1), Position3D(0, 1, 1))
    cnot_zx.add_edge(Position3D(0, 1, 1), Position3D(0, 1, 2))
    cnot_zx.add_edge(Position3D(0, 1, 2), Position3D(1, 1, 2))
    cnot_zx.add_edge(Position3D(1, 1, 0), Position3D(1, 1, 1))
    cnot_zx.add_edge(Position3D(1, 1, 1), Position3D(1, 1, 2))
    cnot_zx.add_edge(Position3D(1, 1, 2), Position3D(1, 1, 3))
    return cnot_zx.to_block_graph("Logical CNOT Block Graph")


def generate_stim_circuit(
    compiled_graph: CompiledGraph, k: int, p: float
) -> stim.Circuit:
    circuit_without_detectors = compiled_graph.generate_stim_circuit(
        k, noise_model=NoiseModel.uniform_depolarizing(p)
    )
    # For now, we annotate the detectors as post-processing step
    return annotate_detectors_automatically(circuit_without_detectors)


def generate_cnot_circuits(*ks: int):
    # 1 Create `BlockGraph` representing the computation
    block_graph = create_block_graph(from_scratch=False)

    # 2. (Optional) Find and choose the logical observables
    observables = block_graph.get_abstract_observables()

    # 3. Compile the `BlockGraph`
    compiled_graph = compile_block_graph(
        block_graph,
        spec_rules=CSS_SPEC_RULES,
        substitute_rules=DEFAULT_SUBSTITUTION_RULES,
        observables=[observables[1]],
    )

    for k in ks:
        _ = generate_stim_circuit(compiled_graph, k, 0.001)


def main():
    parser = argparse.ArgumentParser(description="")
    parser.add_argument(
        "-k",
        help="The scale factors applied to the circuits.",
        nargs="+",
        type=int,
        required=True,
    )
    args = parser.parse_args()
    generate_cnot_circuits(*args.k)


if __name__ == "__main__":
    main()

Automatic detector annotation takes nearly 90% of the total time, and scales poorly with k so ends up taking even more than 90% for larger values of k.

This might be partially due to inefficiencies in the detector computation (e.g., using our own Python-based PauliString implementation instead of a more efficient implementation like stim.PauliString), the main reason for this bad performance is due to how we actually call the detector computation routine: we call it on the final, fully generated circuit.

The main drawback of using the final circuit is that detector computation is performed on all the used qubits, which in turn implies that very large stim.Tableau are generated. Even though the data-structure is efficient both in theory (operations that scale at most polynomially in the number of qubits) and in practice (efficient implementation with high-performance in mind), the detector computation does not scale linarly with the number of qubits (at least 4n² + O(n) as this is the memory requirement to store a n qubit stabilizer as a stim.Tableau, see http://arxiv.org/abs/2103.02202 in section 2.3).

In practice, we do not have to use the full quantum circuit. In the QEC codes that we consider for the moment, detectors are local both in time and in space.

Describe the solution you'd like

The solution I have in mind (and that will have to be debated and improved here) is the following.

Being able to compute detectors locally

I think that we should try to call the detector computation procedures on the smallest possible stim.Circuit possible. To reach this goal, I think that we should extend the existing sub-template approach. In a few steps:

Make it easier to compute detectors on whole Template instances through sub-templates. Most of the code is already present in the code base:

[x] split any Template instance into sub-templates (done, see tqec.templates.subtemplates)
[ ] #352
[ ] #353
[ ] #354

About the third point and having a kind of DetectorDatabase, I really think that this is an important idea that we should eventually have in the code base. Having such a data-structure would bring several non-negligible benefits in my opinion:

being able to visualize in a concise manner all the different kind of detectors present in the circuit,
being able for a user to provide its own DetectorDatabase with detectors that might be better suited to the decoder he/she is planning to use,
being able for a user to provide its own DetectorDatabase to be sure that the detectors inserted in the circuit are deterministic (and ensure replicability).

Such a data-structure would be a mapping from a "situation" to the detector(s) corresponding to this "situation". Here, a "situation" is the local neighbourhood of a measurement and can be represented by:

a list of Template instantiations (most likely, sub-template instantiations as the goal is too keep the entries small here), each entry representing the spatial neighbourhood at one time step,
a Plaquettes (or list of Plaquettes, one for each time step) instance representing the different plaquettes used.

Note that the tqec.templates.subtemplates module already have the notion of "situation": different integers in UniqueSubTemplates.subtemplate_indices represent different "situations".

Refactor the circuit creation pipeline

We also need to refactor the circuit creation process. The code used to transform a BlockGraph instance first to a CompiledGraph and then to a stim.Circuit should, in my opinion, use the spatial information we have on blocks to try to split the circuit generation process into the smallest possible chunks.

I do not have a definite idea on how to improve that for the moment, but will likely end up trying things until I have a satisfactory idea or someone else has.

Oct 10 '24 09:10 nelimee

compute detectors on each of the different sub-template (require to filter out some detectors that do not involve the main measurements we are interested in, which is not done yet)

Don't we need to combine different sub-templates in contiguous time steps to compute detectors?

being able to visualize in a concise manner all the different kind of detectors present in the circuit

I like this point. Ideally, this can be some subcircuit annotated with the closed stabilizer flow representing the detectors.(can be open in Crumble?)

We also need to refactor the circuit creation process. The code used to transform a BlockGraph instance first to a CompiledGraph and then to a stim.Circuit should, in my opinion, use the spatial information we have on blocks to try to split the circuit generation process into the smallest possible chunks.

I think this point is not so important for now and not sure whether it's necessary when the k scales up.

Oct 10 '24 12:10 inmzhang

Don't we need to combine different sub-templates in contiguous time steps to compute detectors?

We might, but that should not be an issue to implement.

I like this point. Ideally, this can be some subcircuit annotated with the closed stabilizer flow representing the detectors.(can be open in Crumble?)

I will soon submit a tentative PR for a DetectorDatabase implementation. Visualization is quite trivial thanks to stim, so yes.

I think this point is not so important for now and not sure whether it's necessary when the k scales up.

I agree this is not the priority, benchmarks will tell us about scaling. Provided that we find a way to efficiently compute the detectors with the current generation process.

Oct 10 '24 12:10 nelimee