tqec icon indicating copy to clipboard operation
tqec copied to clipboard

Improve performance for MacOS users

Open nelimee opened this issue 10 months ago • 27 comments

Describe the bug

It seems like MacOS users are experiencing poor performances when building circuits with TQEC. It would be nice to be able to measure that objectively.

Steps to reproduce the behavior

With main.py being

from tqec import Basis, NoiseModel, compile_block_graph
from tqec.gallery.cnot import cnot

block_graph = cnot(Basis.Z)
correlation_surfaces = block_graph.find_correlation_surfaces()

compiled_computation = compile_block_graph(
    block_graph, observables=[correlation_surfaces[1]]
)

circuit = compiled_computation.generate_stim_circuit(
    k=2,
    noise_model=NoiseModel.uniform_depolarizing(0.001),
)

do the following

python -m pip install tqec[bench]
python -m pyinstrument -o benchmark.html -r html main.py

trying to reduce as much as possible the parallel load on your computer (if possible, close all other applications, do nothing on your computer during the benchmark, ...).

Then, share the following information:

  • the benchmark.html file that has been generated (I have vague memories of GitHub not accepting such files as attachments, if that is still the case I'll open a discussion on the Google group),
  • as many details about your computer as you can (OS, processor, amount of RAM, Python version, output of python -m pip freeze, ...).

For laptop users only:

  • first, do the benchmark with your regular setup (i.e., without touching anything related to power),
  • if you have the time to do so, it would also be interesting to re-do the benchmark with your laptop plugged-in and in charge mode,
  • if you have even more time and willingness it would be interesting to try to disable power saving options and re-do the benchmark.

For reference, on my computer:

  • python main.py takes ~14.5s,
  • python -m pyinstrument -o benchmark.html -r html main.py takes ~22.5s.

nelimee avatar Mar 05 '25 17:03 nelimee

I can confirm this issue might be faced by mac users mostly.

main.py takes approximately 11-12 s on my ThinkPad.

OS - Linux Mint 21.3 Cinnamon 6.0.4 processor - 13th Gen Intel Core i5-1335U x 10 amount of RAM - 16 GB Python version - 3.13.2

Link to html file: https://drive.google.com/file/d/1cmDfktq1KtC6ZVfJVsPPWRLW7YNMogT0/view?usp=sharing

purva-thakre avatar Mar 05 '25 18:03 purva-thakre

Arch Linux Intel i5-14600KF (20) @ 5.30 GHz 32GB RAM Python 3.12.6

python main.py ~10s python -m pyinstrument -o benchmark.html -r html main.py ~14s

inmzhang avatar Mar 06 '25 03:03 inmzhang

After a discussion and live testing with Ángela:

At first glance, the problem seems to be independent from a particular tqec module: it seems like every function call is slowed down. I will dig more into this later.

As a first "solution" for MacOS users (everyone will benefit from this, but MacOS users will likely see a huge improvement), you can try to use DetectorDatabase.

from pathlib import Path

from tqec import Basis, NoiseModel, compile_block_graph
from tqec.compile.detectors.database import DetectorDatabase
from tqec.gallery.cnot import cnot

block_graph = cnot(Basis.Z)
correlation_surfaces = block_graph.find_correlation_surfaces()

compiled_computation = compile_block_graph(
    block_graph, observables=[correlation_surfaces[1]]
)

database_path = Path("./database_cnot.pkl")
if database_path.exists():
    database = DetectorDatabase.from_file(database_path)
else:
    database = DetectorDatabase()

circuit = compiled_computation.generate_stim_circuit(
    k=2,
    noise_model=NoiseModel.uniform_depolarizing(0.001),
    detector_database=database,
)

database.to_file(database_path)

A few subtleties that should be noted:

  • the first run will see a modest improvement, this is because the database needs to be populated, but some computations can still be avoided,
  • the second run (and any subsequent run with a populated database) will see a huge boost in performance,
  • the database should be valid whatever the computation / value of k: you can re-use the same database, over and over again, even when changing the computation or value of k. Note that in the code above, the database is unconditionally saved, overwriting the existing one.
  • there is a plateau phenomenon on k: for small values of k (something like [1, 5] but that depends on the computation), increasing k also increases the time it takes to generate the circuit. As soon as the plateau is reached, increasing k should have a negligible impact on performance. In other words, generating with a populated database for k=20 and for k=30 should take a similar time.

For reference, for the CNOT with k=2, on my computer:

  • Without database: ~16s.
  • First run with the database: ~10s.
  • Second and subsequent run with the database fully populated: 2s.

From memory, on Ángela M1 mac:

  • Without database: ~180s.
  • First run with the database: ~80s.
  • Second and subsequent run with the database fully populated: 8s.

Note that, for my computer, I made a few benchmarks when introducing the DetectorDatabase. You can find them on this comment.

nelimee avatar Mar 06 '25 19:03 nelimee

I'm running Windows on quite an old ThinkPad and this made a big difference to me too: Without database: 94s 1st run with database: 50s 2nd run with database: 10s.

Computer specs: OS: Windows 10 Processor: Intel i5-3320 M @ 2.60 GHz RAM: 8GB Python: 3.12.6

BSchelpe avatar Mar 07 '25 19:03 BSchelpe

Computer specs:

  • Asus Laptop. x64.
  • OS: Windows 11.
  • Processor: Intel i7-1065G7 @ 1.30GHz, 4 cores.
  • RAM: 16GB
  • Video: Intel Iris Plus Graphics.
  • Python. Python 3.12.6 (running from venv).

Times:

  • main.py: 29s (plugged), 32s (unplugged or && and with external screen attached).
  • python -m pyinstrument -o benchmark.html -r html main.py: 40s (plugged), 44s (unplugged or && and with external screen attached).

jbolns avatar Mar 12 '25 10:03 jbolns

Looking at the benchmarks everyone sent, it seems like there might be an issue with the following line:

https://github.com/tqec/tqec/blob/503c7dca6297a4fc7f79a46480389d7b0dcf299f/src/tqec/circuit/moment.py#L281

From the provided benchmarks, at one place in the code, the above line takes:

  • 3.4% of the total execution time on my computer (Linux),
  • 7.8% of the total execution time on J's computer (Windows 11),
  • 17.1% of the total execution time on Ángela's computer (MacOS with M2 chip),
  • 33.6% of the total execution time on Kabir's computer (MacOS with M3 chip).

I tried to replicate the workload with the following code:

import time
from typing import Iterator

import stim


def iterate_flat_circuit(circuit: stim.Circuit) -> Iterator[stim.CircuitInstruction]:
    yield from circuit  # type: ignore


for rounds in [10, 100, 1000, 10000, 100000]:
    circuit = stim.Circuit.generated(
        "surface_code:rotated_memory_z", distance=11, rounds=rounds
    ).flattened()

    start = time.time_ns()
    instructions_count = sum(1 for _ in iterate_flat_circuit(circuit))
    end = time.time_ns()
    print(
        f"{rounds:>6} rounds, {instructions_count:>8} instructions done "
        f"in {(end - start) / 10**6:.2f}ms."
    )

You do not have to use pyinstrument anymore, just run the code and copy-paste the output here.

For reference, on my computer, here are the results:

    10 rounds,     1584 instructions done in 1.25ms.
   100 rounds,    13644 instructions done in 9.59ms.
  1000 rounds,   134244 instructions done in 91.45ms.
 10000 rounds,  1340244 instructions done in 907.99ms.
100000 rounds, 13400244 instructions done in 9076.81ms.

For MacOS users, you do not have to finish the benchmark. If it takes too much time on your machine, stop the execution and report only what has been benchmarked. Note that the time scales linearly on my machine, which is exactly what is expected, so having the first 3 points should already be sufficient to have a good enough idea of the performance.

If MacOS users are experiencing slowdowns, then that may be due to pre-compiled binaries of stim not being as well optimised on MacOS as on Linux. More investigations will have to be performed once we have the benchmark results of everyone.

nelimee avatar Mar 12 '25 11:03 nelimee

Note that the last message was a call for action. I do not have any MacOS-based machine on hand, so I cannot benchmark things by myself. If you have a MacOS-based machine and would like TQEC to be more efficient, please answer here with the timings returned by the code above.

nelimee avatar Mar 18 '25 12:03 nelimee

Note that the last message was a call for action. I do not have any MacOS-based machine on hand, so I cannot benchmark things by myself. If you have a MacOS-based machine and would like TQEC to be more efficient, please answer here with the timings returned by the code above.

I can do a benchmark on my M1Pro Mac at tomorrow.

inmzhang avatar Mar 18 '25 13:03 inmzhang

Note that the last message was a call for action. I do not have any MacOS-based machine on hand, so I cannot benchmark things by myself. If you have a MacOS-based machine and would like TQEC to be more efficient, please answer here with the timings returned by the code above.

10 rounds,     1584 instructions done in 2.26ms.
100 rounds,    13644 instructions done in 11.19ms.
1000 rounds,   134244 instructions done in 111.59ms.
10000 rounds,  1340244 instructions done in 1094.66ms.
100000 rounds, 13400244 instructions done in 11764.16ms.

Ran this on a 16 GB Apple M3 macOS 14.6.1. I wrote some more specs in the benchmarks thread on the Google group with subject "Sharing benchmarks" (major differences are that my laptop was charging and I was running more apps). Thanks, Adrien!

KabirDubey avatar Mar 18 '25 13:03 KabirDubey

Note that the last message was a call for action. I do not have any MacOS-based machine on hand, so I cannot benchmark things by myself. If you have a MacOS-based machine and would like TQEC to be more efficient, please answer here with the timings returned by the code above.

10 rounds,     1584 instructions done in 2.26ms.
100 rounds,    13644 instructions done in 11.19ms.
1000 rounds,   134244 instructions done in 111.59ms.
10000 rounds,  1340244 instructions done in 1094.66ms.
100000 rounds, 13400244 instructions done in 11764.16ms.

Ran this on a 16 GB Apple M3 macOS 14.6.1. I wrote some more specs in the benchmarks thread on the Google group with subject "Sharing benchmarks" (major differences are that my laptop was charging and I was running more apps). Thanks, Adrien!

Humm, that's not what I expected. Could you please re-run the original benchmark?

main.py

from tqec import Basis, NoiseModel, compile_block_graph
from tqec.gallery.cnot import cnot

block_graph = cnot(Basis.Z)
correlation_surfaces = block_graph.find_correlation_surfaces()

compiled_computation = compile_block_graph(
    block_graph, observables=[correlation_surfaces[1]]
)

circuit = compiled_computation.generate_stim_circuit(
    k=2,
    noise_model=NoiseModel.uniform_depolarizing(0.001),
)

and

python -m pyinstrument -o benchmark.html -r html main.py

and share the resulting .html file to the shared Drive folder linked here: https://groups.google.com/g/tqec-design-automation/c/fUvzugEbNyY ? Make that benchmark with your laptop charging if possible, and do not overwrite your benchmark with your laptop on battery as I would like to compare.

nelimee avatar Mar 18 '25 14:03 nelimee

and share the resulting .html file to the shared Drive folder linked here: https://groups.google.com/g/tqec-design-automation/c/fUvzugEbNyY ? Make that benchmark with your laptop charging if possible, and do not overwrite your benchmark with your laptop on battery as I would like to compare.

Done, see file titled kabir_laptop_charging

KabirDubey avatar Mar 18 '25 16:03 KabirDubey

Ok, let's change the profiling library to get a different granularity (anyone with a Mac is encouraged to do so, the more data we get, the quicker we might be able to spot the performance problem).

First, install the line_profiler package with python -m pip install line_profiler.

In src/tqec/circuit/moment.py add the following lines:

from line_profiler import profile

# Code from Moment class ...

@profile 
# def with_mapped_qubit_indices(self, qubit_index_map: dict[int, int]) -> Moment:

In other words, add the profile decorator from the line_profile package to the with_mapped_qubit_indices method from the Moment class in src/tqec/circuit/moment.py.

In src/tqec/circuit/schedule/manipulation.py do the same:

from line_profiler import profile

# Code for several functions 

@profile 
# def merge_scheduled_circuits(
#     circuits: list[ScheduledCircuit],
#     global_qubit_map: QubitMap,
#     mergeable_instructions: Iterable[str] = (),
# ) -> ScheduledCircuit:

Then run the original main.py:

from tqec import Basis, NoiseModel, compile_block_graph
from tqec.gallery.cnot import cnot

block_graph = cnot(Basis.Z)
correlation_surfaces = block_graph.find_correlation_surfaces()

compiled_computation = compile_block_graph(
    block_graph, observables=[correlation_surfaces[1]]
)

circuit = compiled_computation.generate_stim_circuit(
    k=2,
    noise_model=NoiseModel.uniform_depolarizing(0.001),
)

by using the following

LINE_PROFILE=1 python main.py

This should output a message like

Timer unit: 1e-09 s

  5.81 seconds - /.../tqec/src/tqec/circuit/moment.py:335 - with_mapped_qubit_indices
Wrote profile results to profile_output.txt
Wrote profile results to profile_output_2025-03-18T165611.txt
Wrote profile results to profile_output.lprof
To view details run:
python -m line_profiler -rtmz profile_output.lprof

Share here the profile_output.txt file (you can remove information from the paths if you do not want your name to appear here).

As a reference, here is what I get:

Timer unit: 1e-09 s

Total time: 4.72901 s
File: /workspaces/tqec/src/tqec/circuit/schedule/manipulation.py
Function: merge_scheduled_circuits at line 244

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   244                                           @profile
   245                                           def merge_scheduled_circuits(
   246                                               circuits: list[ScheduledCircuit],
   247                                               global_qubit_map: QubitMap,
   248                                               mergeable_instructions: Iterable[str] = (),
   249                                           ) -> ScheduledCircuit:
   250                                               """Merge several :class:`~tqec.circuit.schedule.circuit.ScheduledCircuit`
   251                                               instances into one instance.
   252                                           
   253                                               This function takes several **compatible** scheduled circuits as input and
   254                                               merge them, respecting their schedules, into a unique
   255                                               :class:`~tqec.circuit.schedule.circuit.ScheduledCircuit` instance that will
   256                                               then be returned to the caller.
   257                                           
   258                                               The provided circuits should be compatible between each other. Compatible
   259                                               circuits are circuits that can all be described with a unique global qubit
   260                                               map. In other words, if two circuits from the list of compatible circuits
   261                                               use the same qubit index, that should mean that they use the same qubit.
   262                                               You can obtain compatible circuits by using
   263                                               :func:`relabel_circuits_qubit_indices`.
   264                                           
   265                                               Args:
   266                                                   circuits: **compatible** circuits to merge.
   267                                                   qubit_map: global qubit map for all the provided ``circuits``.
   268                                                   mergeable_instructions: a list of instruction names that are considered
   269                                                       mergeable. Duplicate instructions with a name in this list will be
   270                                                       merged into a single instruction.
   271                                           
   272                                               Returns:
   273                                                   a circuit representing the merged scheduled circuits given as input.
   274                                               """
   275      1449   38175632.0  26346.2      0.8      scheduled_circuits = _ScheduledCircuits(circuits, global_qubit_map)
   276                                           
   277      1449     296846.0    204.9      0.0      all_moments: list[Moment] = []
   278      1449    5727508.0   3952.7      0.1      all_schedules = Schedule()
   279     54222   50714062.0    935.3      1.1      global_i2q = QubitMap({i: q for q, i in scheduled_circuits.q2i.items()})
   280                                           
   281     10143   47266490.0   4660.0      1.0      while scheduled_circuits.has_pending_moment():
   282      8694  341585799.0  39289.8      7.2          schedule, moments = scheduled_circuits.collect_moments_at_minimum_schedule()
   283                                                   # Flatten the moments into a list of operations to perform some modifications
   284     17388  888520376.0  51099.6     18.8          instructions: list[stim.CircuitInstruction] = sum(
   285      8694    3587129.0    412.6      0.1              (list(moment.instructions) for moment in moments), start=[]
   286                                                   )
   287                                                   # Avoid duplicated operations. Any operation that have the Plaquette.get_mergeable_tag() tag
   288                                                   # is considered mergeable, and can be removed if another operation in the list
   289                                                   # is considered equal (and has the mergeable tag).
   290     17388 1543090526.0  88744.6     32.6          deduplicated_instructions = remove_duplicate_instructions(
   291      8694    1256366.0    144.5      0.0              instructions,
   292      8694    4002590.0    460.4      0.1              mergeable_instruction_names=frozenset(mergeable_instructions),
   293                                                   )
   294      8694  269700961.0  31021.5      5.7          merged_instructions = merge_instructions(deduplicated_instructions)
   295      8694    8874976.0   1020.8      0.2          circuit = stim.Circuit()
   296     20286    7542743.0    371.8      0.2          for inst in merged_instructions:
   297     23184  148588707.0   6409.1      3.1              circuit.append(
   298     11592    6126590.0    528.5      0.1                  inst.name,
   299     11592  319458009.0  27558.5      6.8                  sum(_sort_target_groups(inst.target_groups()), start=[]),
   300     11592    6857295.0    591.6      0.1                  inst.gate_args_copy(),
   301                                                       )
   302      8694  969022955.0 111458.8     20.5          all_moments.append(Moment(circuit))
   303      8694   35410726.0   4073.0      0.7          all_schedules.append(schedule)
   304                                           
   305      1449   33204071.0  22915.2      0.7      return ScheduledCircuit(all_moments, all_schedules, global_i2q, _avoid_checks=True)

Total time: 5.77779 s
File: /workspaces/tqec/src/tqec/circuit/moment.py
Function: with_mapped_qubit_indices at line 335

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   335                                               @profile
   336                                               def with_mapped_qubit_indices(self, qubit_index_map: dict[int, int]) -> Moment:
   337                                                   """Map the qubits **indices** the :class:`Moment` instance is applied
   338                                                   on.
   339                                           
   340                                                   Note:
   341                                                       This method has to iterate over all the instructions in ``self`` and
   342                                                       change the gate target they are applied on.
   343                                           
   344                                                   Args:
   345                                                       qubit_index_map: the map used to modify the qubit targets.
   346                                           
   347                                                   Returns:
   348                                                       a modified copy of ``self`` with the qubit gate targets mapped according
   349                                                       to the provided ``qubit_index_map``.
   350                                                   """
   351    135756  155407303.0   1144.8      2.7          circuit = stim.Circuit()
   352    276974 1095316149.0   3954.6     19.0          for instr in self.instructions:
   353    141218   27182953.0    192.5      0.5              mapped_targets: list[stim.GateTarget] = []
   354    576297  312885511.0    542.9      5.4              for target in instr.targets_copy():
   355                                                           # Non qubit targets are left untouched.
   356    435079  198306040.0    455.8      3.4                  if not target.is_qubit_target:
   357                                                               mapped_targets.append(target)
   358                                                               continue
   359                                                           # Qubit targets are mapped using `qubit_index_map`
   360    435079  341286586.0    784.4      5.9                  target_qubit = cast(int, target.qubit_value)
   361    870158  157767122.0    181.3      2.7                  mapped_targets.append(
   362    435079 1419714503.0   3263.1     24.6                      stim.GateTarget(qubit_index_map[target_qubit])
   363    435079  193712045.0    445.2      3.4                      if not target.is_inverted_result_target
   364                                                               else stim.GateTarget(-qubit_index_map[target_qubit])
   365                                                           )
   366    141218 1370112174.0   9702.1     23.7              circuit.append(instr.name, mapped_targets, instr.gate_args_copy())
   367    271512  284257189.0   1046.9      4.9          return Moment(
   368    135756   20835792.0    153.5      0.4              circuit,
   369    570835  183084783.0    320.7      3.2              used_qubits={qubit_index_map[q] for q in self._used_qubits},
   370    135756   17916959.0    132.0      0.3              _avoid_checks=True,
   371                                                   )

  4.73 seconds - /workspaces/tqec/src/tqec/circuit/schedule/manipulation.py:244 - merge_scheduled_circuits
  5.78 seconds - /workspaces/tqec/src/tqec/circuit/moment.py:335 - with_mapped_qubit_indices

nelimee avatar Mar 18 '25 17:03 nelimee

Ok, let's change the profiling library to get a different granularity (anyone with a Mac is encouraged to do so, the more data we get, the quicker we might be able to spot the performance problem).

The command LINE_PROFILE=1 python main.py takes ~10 mins to complete and outputs neither a message nor a profile_output.txt file. Here's my git diff on main.

diff --git a/src/tqec/circuit/moment.py b/src/tqec/circuit/moment.py
index 9f162866..4b5ccbf1 100644
--- a/src/tqec/circuit/moment.py
+++ b/src/tqec/circuit/moment.py
@@ -18,6 +18,7 @@ from tqec.circuit.qubit import count_qubit_accesses, get_used_qubit_indices
 from tqec.utils.exceptions import TQECException
 from tqec.utils.instructions import is_annotation_instruction
 
+from line_profiler import profile 
 
 class Moment:
     """A collection of instructions that can be executed in parallel.
@@ -330,7 +331,8 @@ class Moment:
             used_qubits=self._used_qubits,
             _avoid_checks=True,
         )
-
+    
+    @profile 
     def with_mapped_qubit_indices(self, qubit_index_map: dict[int, int]) -> Moment:
         """Map the qubits **indices** the :class:`Moment` instance is applied
         on.
diff --git a/src/tqec/circuit/schedule/manipulation.py b/src/tqec/circuit/schedule/manipulation.py
index c8795415..1ede3463 100644
--- a/src/tqec/circuit/schedule/manipulation.py
+++ b/src/tqec/circuit/schedule/manipulation.py
@@ -30,6 +30,8 @@ from tqec.circuit.schedule.circuit import ScheduledCircuit
 from tqec.circuit.schedule.schedule import Schedule
 from tqec.utils.exceptions import TQECException, TQECWarning
 
+from line_profiler import profile
+
 
 class _ScheduledCircuits:
     def __init__(
@@ -239,7 +241,7 @@ def merge_instructions(
         for (name, args), targets in instructions_merger.items()
     ]
 
-
+@profile
 def merge_scheduled_circuits(
     circuits: list[ScheduledCircuit],
     global_qubit_map: QubitMap,
:...skipping...
diff --git a/src/tqec/circuit/moment.py b/src/tqec/circuit/moment.py
index 9f162866..4b5ccbf1 100644
--- a/src/tqec/circuit/moment.py
+++ b/src/tqec/circuit/moment.py
@@ -18,6 +18,7 @@ from tqec.circuit.qubit import count_qubit_accesses, get_used_qubit_indices
 from tqec.utils.exceptions import TQECException
 from tqec.utils.instructions import is_annotation_instruction
 
+from line_profiler import profile 
 
 class Moment:
     """A collection of instructions that can be executed in parallel.
@@ -330,7 +331,8 @@ class Moment:
             used_qubits=self._used_qubits,
             _avoid_checks=True,
         )
-
+    
+    @profile 
     def with_mapped_qubit_indices(self, qubit_index_map: dict[int, int]) -> Moment:
         """Map the qubits **indices** the :class:`Moment` instance is applied
         on.
diff --git a/src/tqec/circuit/schedule/manipulation.py b/src/tqec/circuit/schedule/manipulation.py
index c8795415..1ede3463 100644
--- a/src/tqec/circuit/schedule/manipulation.py
+++ b/src/tqec/circuit/schedule/manipulation.py
@@ -30,6 +30,8 @@ from tqec.circuit.schedule.circuit import ScheduledCircuit
 from tqec.circuit.schedule.schedule import Schedule
 from tqec.utils.exceptions import TQECException, TQECWarning
 
+from line_profiler import profile
+
 
 class _ScheduledCircuits:
     def __init__(
@@ -239,7 +241,7 @@ def merge_instructions(
         for (name, args), targets in instructions_merger.items()
     ]
 
-
+@profile
 def merge_scheduled_circuits(
     circuits: list[ScheduledCircuit],
     global_qubit_map: QubitMap,
~

I can try to debug, but I've never used the line_profiler library so may save some time if you take a look first.

KabirDubey avatar Mar 18 '25 17:03 KabirDubey

Maybe to check:

import os

from tqec import Basis, NoiseModel, compile_block_graph
from tqec.gallery.cnot import cnot

ENV_VAR_NAME = "LINE_PROFILE"
print(f"{ENV_VAR_NAME}:", os.environ.get(ENV_VAR_NAME, "<not set>")

block_graph = cnot(Basis.Z)
correlation_surfaces = block_graph.find_correlation_surfaces()

compiled_computation = compile_block_graph(
    block_graph, observables=[correlation_surfaces[1]]
)

circuit = compiled_computation.generate_stim_circuit(
    k=2,
    noise_model=NoiseModel.uniform_depolarizing(0.001),
)

If it does not print "LINE_PROFILE: 1" but prints "LINE_PROFILE: ", then maybe follow this guide? If it prints "LINE_PROFILE: 1" but still does not work, we will have to investigate more in-depth.

nelimee avatar Mar 18 '25 18:03 nelimee

I did not have any problem following Adrien's instructions when benchmark on my M1Pro laptop. Here's the output:

Timer unit: 1e-09 s

Total time: 60.1848 s
File: /Users/inm/open-source-project/tqec/src/tqec/circuit/schedule/manipulation.py
Function: merge_scheduled_circuits at line 243

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   243                                           @profile
   244                                           def merge_scheduled_circuits(
   245                                               circuits: list[ScheduledCircuit],
   246                                               global_qubit_map: QubitMap,
   247                                               mergeable_instructions: Iterable[str] = (),
   248                                           ) -> ScheduledCircuit:
   249                                               """Merge several :class:`~tqec.circuit.schedule.circuit.ScheduledCircuit`
   250                                               instances into one instance.
   251                                           
   252                                               This function takes several **compatible** scheduled circuits as input and
   253                                               merge them, respecting their schedules, into a unique
   254                                               :class:`~tqec.circuit.schedule.circuit.ScheduledCircuit` instance that will
   255                                               then be returned to the caller.
   256                                           
   257                                               The provided circuits should be compatible between each other. Compatible
   258                                               circuits are circuits that can all be described with a unique global qubit
   259                                               map. In other words, if two circuits from the list of compatible circuits
   260                                               use the same qubit index, that should mean that they use the same qubit.
   261                                               You can obtain compatible circuits by using
   262                                               :func:`relabel_circuits_qubit_indices`.
   263                                           
   264                                               Args:
   265                                                   circuits: **compatible** circuits to merge.
   266                                                   qubit_map: global qubit map for all the provided ``circuits``.
   267                                                   mergeable_instructions: a list of instruction names that are considered
   268                                                       mergeable. Duplicate instructions with a name in this list will be
   269                                                       merged into a single instruction.
   270                                           
   271                                               Returns:
   272                                                   a circuit representing the merged scheduled circuits given as input.
   273                                               """
   274      1449   21752000.0  15011.7      0.0      scheduled_circuits = _ScheduledCircuits(circuits, global_qubit_map)
   275                                           
   276      1449     169000.0    116.6      0.0      all_moments: list[Moment] = []
   277      1449    4528000.0   3124.9      0.0      all_schedules = Schedule()
   278     54222   33353000.0    615.1      0.1      global_i2q = QubitMap({i: q for q, i in scheduled_circuits.q2i.items()})
   279                                           
   280     10143   25087000.0   2473.3      0.0      while scheduled_circuits.has_pending_moment():
   281      8694  203874000.0  23450.0      0.3          schedule, moments = scheduled_circuits.collect_moments_at_minimum_schedule()
   282                                                   # Flatten the moments into a list of operations to perform some modifications
   283     17388        4e+10    2e+06     66.3          instructions: list[stim.CircuitInstruction] = sum(
   284      8694    1900000.0    218.5      0.0              (list(moment.instructions) for moment in moments), start=[]
   285                                                   )
   286                                                   # Avoid duplicated operations. Any operation that have the Plaquette.get_mergeable_tag() tag
   287                                                   # is considered mergeable, and can be removed if another operation in the list
   288                                                   # is considered equal (and has the mergeable tag).
   289     17388        1e+10 582507.5     16.8          deduplicated_instructions = remove_duplicate_instructions(
   290      8694     892000.0    102.6      0.0              instructions,
   291      8694    2752000.0    316.5      0.0              mergeable_instruction_names=frozenset(mergeable_instructions),
   292                                                   )
   293      8694  162534000.0  18695.0      0.3          merged_instructions = merge_instructions(deduplicated_instructions)
   294      8694    4210000.0    484.2      0.0          circuit = stim.Circuit()
   295     20286    4753000.0    234.3      0.0          for inst in merged_instructions:
   296     23184  796792000.0  34368.2      1.3              circuit.append(
   297     11592    3253000.0    280.6      0.0                  inst.name,
   298     11592  178081000.0  15362.4      0.3                  sum(_sort_target_groups(inst.target_groups()), start=[]),
   299     11592    3410000.0    294.2      0.0                  inst.gate_args_copy(),
   300                                                       )
   301      8694 8691765000.0 999742.9     14.4          all_moments.append(Moment(circuit))
   302      8694   19745000.0   2271.1      0.0          all_schedules.append(schedule)
   303                                           
   304      1449   22587000.0  15588.0      0.0      return ScheduledCircuit(all_moments, all_schedules, global_i2q, _avoid_checks=True)

Total time: 61.4382 s
File: /Users/inm/open-source-project/tqec/src/tqec/circuit/moment.py
Function: with_mapped_qubit_indices at line 335

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   335                                               @profile
   336                                               def with_mapped_qubit_indices(self, qubit_index_map: dict[int, int]) -> Moment:
   337                                                   """Map the qubits **indices** the :class:`Moment` instance is applied
   338                                                   on.
   339                                           
   340                                                   Note:
   341                                                       This method has to iterate over all the instructions in ``self`` and
   342                                                       change the gate target they are applied on.
   343                                           
   344                                                   Args:
   345                                                       qubit_index_map: the map used to modify the qubit targets.
   346                                           
   347                                                   Returns:
   348                                                       a modified copy of ``self`` with the qubit gate targets mapped according
   349                                                       to the provided ``qubit_index_map``.
   350                                                   """
   351    135756   82302000.0    606.2      0.1          circuit = stim.Circuit()
   352    276974        4e+10 155142.2     69.9          for instr in self.instructions:
   353    141218   16041000.0    113.6      0.0              mapped_targets: list[stim.GateTarget] = []
   354    576297  203630000.0    353.3      0.3              for target in instr.targets_copy():
   355                                                           # Non qubit targets are left untouched.
   356    435079  103588000.0    238.1      0.2                  if not target.is_qubit_target:
   357                                                               mapped_targets.append(target)
   358                                                               continue
   359                                                           # Qubit targets are mapped using `qubit_index_map`
   360    435079  183755000.0    422.3      0.3                  target_qubit = cast(int, target.qubit_value)
   361    870158  107407000.0    123.4      0.2                  mapped_targets.append(
   362    435079 8025480000.0  18446.0     13.1                      stim.GateTarget(qubit_index_map[target_qubit])
   363    435079  106704000.0    245.3      0.2                      if not target.is_inverted_result_target
   364                                                               else stim.GateTarget(-qubit_index_map[target_qubit])
   365                                                           )
   366    141218 9348318000.0  66197.8     15.2              circuit.append(instr.name, mapped_targets, instr.gate_args_copy())
   367    271512  150417000.0    554.0      0.2          return Moment(
   368    135756   12844000.0     94.6      0.0              circuit,
   369    570835  114218000.0    200.1      0.2              used_qubits={qubit_index_map[q] for q in self._used_qubits},
   370    135756   13182000.0     97.1      0.0              _avoid_checks=True,
   371                                                   )

 60.18 seconds - /Users/inm/open-source-project/tqec/src/tqec/circuit/schedule/manipulation.py:243 - merge_scheduled_circuits
 61.44 seconds - /Users/inm/open-source-project/tqec/src/tqec/circuit/moment.py:335 - with_mapped_qubit_indices

inmzhang avatar Mar 19 '25 04:03 inmzhang

From @inmzhang benchmarks, it seems like the Moment.instructions lines are more costly on MacOS than on GNU/Linux. The code I shared should replicate that behaviour, but fails to do so. Let's get even closer to the actual code with a new benchmark:

import time
from typing import Iterator

import stim

class FakeMoment:
    def __init__(self, circuit: stim.Circuit) -> None:
        self._circuit = circuit

    @property
    def instructions(self) -> Iterator[stim.CircuitInstruction]:
        yield from self._circuit

for rounds in [10, 100, 1000, 10000, 100000]:
    circuit = stim.Circuit.generated(
        "surface_code:rotated_memory_z", distance=11, rounds=rounds
    ).flattened()
    moment = FakeMoment(circuit)

    start = time.time_ns()
    instructions_count = sum(1 for _ in moment.instructions)
    end = time.time_ns()
    print(
        f"{rounds:>6} rounds, {instructions_count:>8} instructions done "
        f"in {(end - start) / 10**6:.2f}ms."
    )

For reference, on my computer and on the main branch:

    10 rounds,     1584 instructions done in 1.24ms.
   100 rounds,    13644 instructions done in 9.45ms.
  1000 rounds,   134244 instructions done in 90.41ms.
 10000 rounds,  1340244 instructions done in 884.91ms.
100000 rounds, 13400244 instructions done in 8936.75ms.

nelimee avatar Mar 19 '25 09:03 nelimee

@nelimee, I left my laptop at office and will update the benchmark at tomorrow.

inmzhang avatar Mar 19 '25 11:03 inmzhang

On the M1Pro Mac:

    10 rounds,     1584 instructions done in 1.02ms.
   100 rounds,    13644 instructions done in 7.75ms.
  1000 rounds,   134244 instructions done in 76.34ms.
 10000 rounds,  1340244 instructions done in 755.06ms.
100000 rounds, 13400244 instructions done in 7646.37ms.

inmzhang avatar Mar 20 '25 03:03 inmzhang

So to summarise:

  • Both line_profiler and pyinstrument show that Moment.instructions take a large portion of the time on MacOS,
  • None of my replication trials are able to replicate the issue.

Let's try one more profiler:

python -m cProfile -o benchmark.cprofile main.py

and share the benchmark.cprofile file please.

nelimee avatar Mar 20 '25 09:03 nelimee

So to summarise:

* Both `line_profiler` and `pyinstrument` show that `Moment.instructions` take a large portion of the time on MacOS,

* None of my replication trials are able to replicate the issue.

Let's try one more profiler:

python -m cProfile -o benchmark.cprofile main.py

and share the benchmark.cprofile file please.

See https://drive.google.com/file/d/1k__NFlI1kDSV-wf9Gcf3ipnP85PUW6me/view?usp=drive_link.

inmzhang avatar Mar 20 '25 09:03 inmzhang

See https://drive.google.com/file/d/1k__NFlI1kDSV-wf9Gcf3ipnP85PUW6me/view?usp=drive_link.

Thanks a lot for the quick answer! Let's summarise.

pyinstrument

Note that the following screenshots do not show all the places where Moment.instructions is used, but show enough to highlight the issue.

From @KabirDubey with a M3 Apple chip on MacOS:

Image

From myself on a Ryzen 9 5950X on Archlinux:

Image

Conclusion: the relative time took by Moment.instructions is one order of magnitude higher on Kabir's laptop, which hints at an issue. Note that the absolute time is not relevant here, because the benchmark settings are very different. It is expected that an M3 chip is slower (because it is a laptop chip that is optimised for energy consumption), but it should be slower everywhere, not just on one part.

line_profiler

This is less visible but the lines involving Moment.instructions take:

  • 60% to 70% of the benchmarked function time on Yiming's laptop (M1 Pro),
  • ~20% of the benchmarked function time on my computer.

So even though this is not an order of magnitude, there is still a large discrepancy.

cProfile

Profiling the exact same main.py with

>>> python -m cProfile -o benchmark.cprofile main.py

and analysing the results with

>>> python -m pstats benchmark.cprofile
Welcome to the profile statistics browser.
benchmark.cprofile% sort tottime
benchmark.cprofile% stats 20

outputs the following for Yiming's (M1 Pro) laptop:

Thu Mar 20 12:52:21 2025    Yiming_M1Pro.cprofile

         35610013 function calls (35274253 primitive calls) in 112.453 seconds

   Ordered by: internal time
   List reduced from 8098 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   534414   83.650    0.000   83.650    0.000 tqec/src/tqec/circuit/moment.py:270(instructions)
21630/1344    7.295    0.000    0.429    0.000 tqec/src/tqec/circuit/qubit.py:107(count_qubit_accesses)
    41916    5.757    0.000    5.760    0.000 tqec/src/tqec/circuit/moment.py:122(<genexpr>)
    18681    1.832    0.000    1.833    0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/utils.py:201(<genexpr>)
    49214    0.964    0.000    0.979    0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/pauli.py:75(to_stim_pauli_string)
    10059    0.916    0.000    0.917    0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/utils.py:287(<genexpr>)
      102    0.785    0.008    0.788    0.008 ¨built-in method _imp.create_dynamic¼
  2043366    0.704    0.000    1.003    0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/pauli.py:127(anticommutes)
  7376816    0.543    0.000    0.543    0.000 ¨method 'keys' of 'dict' objects¼
    20118    0.472    0.000    0.474    0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/utils.py:115(<genexpr>)
     1437    0.460    0.000    0.469    0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/utils.py:165(has_only_reset_or_is_virtual)
     7185    0.459    0.000    0.463    0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/utils.py:162(<genexpr>)
     7185    0.459    0.000    0.460    0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/utils.py:127(<genexpr>)
     1437    0.458    0.000    0.462    0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/utils.py:130(has_only_measurement_or_is_virtual)
   1341/0    0.384    0.000    0.000          tqec/src/tqec/circuit/moment.py:334(with_mapped_qubit_indices)
    49214    0.376    0.000    0.489    0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/pauli.py:61(from_stim_pauli_string)
    68774    0.324    0.000    1.434    0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/pauli.py:135(collapse_by)
   894240    0.293    0.000    0.423    0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/pauli.py:177(overlaps)
   186112    0.229    0.000    0.351    0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/pauli.py:33(__init__)
     1372    0.212    0.000    0.212    0.000 tqec/src/tqec/circuit/moment.py:153(<genexpr>)

and the following on my computer:

Thu Mar 20 09:03:13 2025    cprofile.txt

         35542180 function calls (35208252 primitive calls) in 16.457 seconds

   Ordered by: internal time
   List reduced from 7959 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   534414    1.606    0.000    1.606    0.000 /workspaces/tqec/src/tqec/circuit/moment.py:270(instructions)
  2043538    1.257    0.000    1.862    0.000 /home/vscode/.local/lib/python3.12/site-packages/tqecd/pauli.py:127(anticommutes)
  7377154    1.099    0.000    1.099    0.000 ¨method 'keys' of 'dict' objects¼
21630/1344    0.777    0.000    0.014    0.000 /workspaces/tqec/src/tqec/circuit/qubit.py:107(count_qubit_accesses)
    49214    0.699    0.000    0.915    0.000 /home/vscode/.local/lib/python3.12/site-packages/tqecd/pauli.py:61(from_stim_pauli_string)
   894240    0.531    0.000    0.795    0.000 /home/vscode/.local/lib/python3.12/site-packages/tqecd/pauli.py:177(overlaps)
    68774    0.474    0.000    2.426    0.000 /home/vscode/.local/lib/python3.12/site-packages/tqecd/pauli.py:135(collapse_by)
   1341/0    0.429    0.000    0.000          /workspaces/tqec/src/tqec/circuit/moment.py:334(with_mapped_qubit_indices)
   186112    0.408    0.000    0.647    0.000 /home/vscode/.local/lib/python3.12/site-packages/tqecd/pauli.py:33(__init__)
1293686/1215862    0.266    0.000    0.281    0.000 ¨built-in method builtins.len¼
    49214    0.263    0.000    0.293    0.000 /home/vscode/.local/lib/python3.12/site-packages/tqecd/pauli.py:75(to_stim_pauli_string)
  1347765    0.256    0.000    0.459    0.000 /home/vscode/.local/lib/python3.12/site-packages/tqecd/pauli.py:48(qubits)
  1230621    0.225    0.000    0.225    0.000 ¨method 'append' of 'list' objects¼
    74439    0.224    0.000    0.695    0.000 /workspaces/tqec/src/tqec/circuit/schedule/circuit.py:29(__init__)
  1205691    0.206    0.000    1.294    0.000 /home/vscode/.local/lib/python3.12/site-packages/tqecd/pauli.py:123(commutes)
  1119357    0.203    0.000    0.324    0.000 /workspaces/tqec/src/tqec/circuit/qubit.py:62(__hash__)
216608/216605    0.198    0.000    0.525    0.000 ¨built-in method builtins.sorted¼
     1495    0.181    0.000    0.181    0.000 ¨method 'read' of '_io.BufferedReader' objects¼
    51832    0.174    0.000    1.262    0.000 /home/vscode/.local/lib/python3.12/site-packages/tqecd/boundary.py:13(__init__)
   889679    0.170    0.000    0.943    0.000 /home/vscode/.local/lib/python3.12/site-packages/tqecd/boundary.py:48(<genexpr>)

Confirming what the other two profiling approaches led us to think: Moment.instructions is one of the major participant to performance issues on MacOS.

Replication

Now, the goal would be to be able to replicate that performance issue with a small and analysable code. My last two attempts failed to do so, showing similar results between MacOS and my computer. Does anyone have any idea how to replicate? I am open to ideas :)

nelimee avatar Mar 20 '25 13:03 nelimee

I can take a closer look and compare between the Mac/Linux machine I have, but I can only do that next week.

inmzhang avatar Mar 20 '25 15:03 inmzhang

@nelimee @inmzhang I am also a Mac user. I can perform the benchmarks too, but I wanted to look at the code to see what might be going on.

As you pointed out previously, casting the Stim.circuit into an iterator (or, in one case, a list) seems to be the bottleneck. Taking a glance at the implementaiton, their implementation is in C++ and involves a lot of memory operations (understandably, given how important this class is). It may be worth reaching out to the quantumlib team to inquire if any of their testing/optimization had been done on Macs or other arm processors. Likewise, they may have a recommendation on how to better extract the instructions. Just an idea I had :-)

smburdick avatar Mar 22 '25 03:03 smburdick

Taking a glance at the implementaiton, their implementation is in C++ and involves a lot of memory operations (understandably, given how important this class is).

For reference, because I did not find the correct information directly:

Python "iterable" is defined as

[...] objects of any classes you define with an __iter__() method or with a __getitem__() method that implements sequence semantics.

in the glossary.

It turns out that stim.Circuit does not implement the __iter__ method (see reference and implementation), so the iteration is done using __getitem__ that is implemented with circuit_get_item.

Also, the stim CI builds on MacOS.

It may be worth reaching out to the quantumlib team to inquire if any of their testing/optimization had been done on Macs or other arm processors.

Why not, but the fact that we are not able to replicate the issue on a small example hints that the issue is at least not entirely due to stim.

nelimee avatar Mar 22 '25 07:03 nelimee

@nelimee Following up on this for the sake of completeness. The source of my delay was a missing -e flag.

Results from the `line_profiler` test
Timer unit: 1e-09 s

Total time: 47.714 s
File: /Users/kabirdubey/Projects/dev/tqec/src/tqec/circuit/schedule/manipulation.py
Function: merge_scheduled_circuits at line 244

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   244                                           @profile
   245                                           def merge_scheduled_circuits(
   246                                               circuits: list[ScheduledCircuit],
   247                                               global_qubit_map: QubitMap,
   248                                               mergeable_instructions: Iterable[str] = (),
   249                                           ) -> ScheduledCircuit:
   250                                               """Merge several :class:`~tqec.circuit.schedule.circuit.ScheduledCircuit`
   251                                               instances into one instance.
   252                                           
   253                                               This function takes several **compatible** scheduled circuits as input and
   254                                               merge them, respecting their schedules, into a unique
   255                                               :class:`~tqec.circuit.schedule.circuit.ScheduledCircuit` instance that will
   256                                               then be returned to the caller.
   257                                           
   258                                               The provided circuits should be compatible between each other. Compatible
   259                                               circuits are circuits that can all be described with a unique global qubit
   260                                               map. In other words, if two circuits from the list of compatible circuits
   261                                               use the same qubit index, that should mean that they use the same qubit.
   262                                               You can obtain compatible circuits by using
   263                                               :func:`relabel_circuits_qubit_indices`.
   264                                           
   265                                               Args:
   266                                                   circuits: **compatible** circuits to merge.
   267                                                   qubit_map: global qubit map for all the provided ``circuits``.
   268                                                   mergeable_instructions: a list of instruction names that are considered
   269                                                       mergeable. Duplicate instructions with a name in this list will be
   270                                                       merged into a single instruction.
   271                                           
   272                                               Returns:
   273                                                   a circuit representing the merged scheduled circuits given as input.
   274                                               """
   275      1449   25431000.0  17550.7      0.1      scheduled_circuits = _ScheduledCircuits(circuits, global_qubit_map)
   276                                           
   277      1449     291000.0    200.8      0.0      all_moments: list[Moment] = []
   278      1449    4693000.0   3238.8      0.0      all_schedules = Schedule()
   279     54222   36757000.0    677.9      0.1      global_i2q = QubitMap({i: q for q, i in scheduled_circuits.q2i.items()})
   280                                           
   281     10143   34618000.0   3413.0      0.1      while scheduled_circuits.has_pending_moment():
   282      8694  237349000.0  27300.3      0.5          schedule, moments = scheduled_circuits.collect_moments_at_minimum_schedule()
   283                                                   # Flatten the moments into a list of operations to perform some modifications
   284     17388        3e+10    2e+06     64.0          instructions: list[stim.CircuitInstruction] = sum(
   285      8694    2723000.0    313.2      0.0              (list(moment.instructions) for moment in moments), start=[]
   286                                                   )
   287                                                   # Avoid duplicated operations. Any operation that have the Plaquette.get_mergeable_tag() tag
   288                                                   # is considered mergeable, and can be removed if another operation in the list
   289                                                   # is considered equal (and has the mergeable tag).
   290     17388 8728432000.0 501980.2     18.3          deduplicated_instructions = remove_duplicate_instructions(
   291      8694    1382000.0    159.0      0.0              instructions,
   292      8694    4887000.0    562.1      0.0              mergeable_instruction_names=frozenset(mergeable_instructions),
   293                                                   )
   294      8694  222691000.0  25614.3      0.5          merged_instructions = merge_instructions(deduplicated_instructions)
   295      8694    5720000.0    657.9      0.0          circuit = stim.Circuit()
   296     20286    7168000.0    353.3      0.0          for inst in merged_instructions:
   297     23184  759537000.0  32761.3      1.6              circuit.append(
   298     11592    3960000.0    341.6      0.0                  inst.name,
   299     11592  219231000.0  18912.3      0.5                  sum(_sort_target_groups(inst.target_groups()), start=[]),
   300     11592    4471000.0    385.7      0.0                  inst.gate_args_copy(),
   301                                                       )
   302      8694 6837622000.0 786476.0     14.3          all_moments.append(Moment(circuit))
   303      8694   28379000.0   3264.2      0.1          all_schedules.append(schedule)
   304                                           
   305      1449   25137000.0  17347.8      0.1      return ScheduledCircuit(all_moments, all_schedules, global_i2q, _avoid_checks=True)

Total time: 49.9701 s
File: /Users/kabirdubey/Projects/dev/tqec/src/tqec/circuit/moment.py
Function: with_mapped_qubit_indices at line 346

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   346                                               @profile
   347                                               def with_mapped_qubit_indices(self, qubit_index_map: dict[int, int]) -> Moment:
   348                                                   """Map the qubits **indices** the :class:`Moment` instance is applied
   349                                                   on.
   350                                           
   351                                                   Note:
   352                                                       This method has to iterate over all the instructions in ``self`` and
   353                                                       change the gate target they are applied on.
   354                                           
   355                                                   Args:
   356                                                       qubit_index_map: the map used to modify the qubit targets.
   357                                           
   358                                                   Returns:
   359                                                       a modified copy of ``self`` with the qubit gate targets mapped according
   360                                                       to the provided ``qubit_index_map``.
   361                                                   """
   362    135756  101568000.0    748.2      0.2          circuit = stim.Circuit()
   363    276974        3e+10 117635.0     65.2          for instr in self.instructions:
   364    141218   21062000.0    149.1      0.0              mapped_targets: list[stim.GateTarget] = []
   365    576297  261380000.0    453.6      0.5              for target in instr.targets_copy():
   366                                                           # Non qubit targets are left untouched.
   367    435079  133660000.0    307.2      0.3                  if not target.is_qubit_target:
   368                                                               mapped_targets.append(target)
   369                                                               continue
   370                                                           # Qubit targets are mapped using `qubit_index_map`
   371    435079  201052000.0    462.1      0.4                  target_qubit = cast(int, target.qubit_value)
   372    870158  139957000.0    160.8      0.3                  mapped_targets.append(
   373    435079 7462386000.0  17151.8     14.9                      stim.GateTarget(qubit_index_map[target_qubit])
   374    435079  131042000.0    301.2      0.3                      if not target.is_inverted_result_target
   375                                                               else stim.GateTarget(-qubit_index_map[target_qubit])
   376                                                           )
   377    141218 8588849000.0  60819.8     17.2              circuit.append(instr.name, mapped_targets, instr.gate_args_copy())
   378    271512  179662000.0    661.7      0.4          return Moment(
   379    135756   19198000.0    141.4      0.0              circuit,
   380    570835  134652000.0    235.9      0.3              used_qubits={qubit_index_map[q] for q in self._used_qubits},
   381    135756   13812000.0    101.7      0.0              _avoid_checks=True,
   382                                                   )

 47.71 seconds - /Users/kabirdubey/Projects/dev/tqec/src/tqec/circuit/schedule/manipulation.py:244 - merge_scheduled_circuits
 49.97 seconds - /Users/kabirdubey/Projects/dev/tqec/src/tqec/circuit/moment.py:346 - with_mapped_qubit_indices

I also ran the focused benchmark on the Moment class.

Results from the `FakeMomentClass` benchmark
    10 rounds,     1584 instructions done in 1.57ms.
   100 rounds,    13644 instructions done in 11.81ms.
  1000 rounds,   134244 instructions done in 108.70ms.
 10000 rounds,  1340244 instructions done in 1066.46ms.
100000 rounds, 13400244 instructions done in 11669.82ms.

~~Also tried the same benchmark, with cprofile. See Kabir_M3Air.cprofile in our drive. My file is 17 KB, but Yiming's 1.2 MB--not sure how to explain this.~~

~~Here's are the sorted total times:~~
(.tqec-venv) kabirdubey@Kabirs-MacBook-Air tqec % python -m pstats benchmark.cprofile
Welcome to the profile statistics browser.
benchmark.cprofile% sort tottime
benchmark.cprofile% stats 20
Mon Mar 24 19:27:08 2025    benchmark.cprofile

         29780667 function calls (29780654 primitive calls) in 14.799 seconds

   Ordered by: internal time
   List reduced from 143 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 14889965   11.120    0.000   11.120    0.000 main.py:10(instructions)
 14889965    2.957    0.000   14.077    0.000 main.py:21(<genexpr>)
        5    0.700    0.140   14.778    2.956 {built-in method builtins.sum}
        2    0.018    0.009    0.018    0.009 {built-in method _imp.create_dynamic}
        5    0.002    0.000    0.002    0.000 {built-in method stim._stim_polyfill.generated}
        1    0.000    0.000    0.000    0.000 {built-in method posix.listdir}
        1    0.000    0.000    0.000    0.000 {method 'read' of '_io.BufferedReader' objects}
        5    0.000    0.000    0.000    0.000 {built-in method builtins.print}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       16    0.000    0.000    0.000    0.000 {built-in method posix.stat}
        7    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1593(find_spec)
        1    0.000    0.000    0.000    0.000 {built-in method _io.open_code}
       27    0.000    0.000    0.000    0.000 /Users/kabirdubey/.pyenv/versions/3.12.5/envs/.tqec-venv/lib/python3.12/site-packages/stim/__init__.py:30(_pytest_pycharm_pybind_repr_bug_workaround)
        1    0.000    0.000    0.019    0.019 /Users/kabirdubey/.pyenv/versions/3.12.5/envs/.tqec-venv/lib/python3.12/site-packages/stim/__init__.py:1(<module>)
        1    0.000    0.000    0.000    0.000 {built-in method posix.getcwd}
      3/1    0.000    0.000    0.019    0.019 <frozen importlib._bootstrap>:1349(_find_and_load)
       29    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:126(_path_join)
        3    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap>:1240(_find_spec)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.__build_class__}
       10    0.000    0.000    0.000    0.000 {built-in method time.time_ns} 

~~I am working on a local copy of the latest version of main.~~

~~Contents of my main.py file~~
import time
from typing import Iterator

import stim

class FakeMoment:
    def __init__(self, circuit: stim.Circuit) -> None:
        self._circuit = circuit

    @property
    def instructions(self) -> Iterator[stim.CircuitInstruction]:
        yield from self._circuit

for rounds in [10, 100, 1000, 10000, 100000]:
    circuit = stim.Circuit.generated(
        "surface_code:rotated_memory_z", distance=11, rounds=rounds
    ).flattened()
    moment = FakeMoment(circuit)

    start = time.time_ns()
    instructions_count = sum(1 for _ in moment.instructions)
    end = time.time_ns()
    print(
        f"{rounds:>6} rounds, {instructions_count:>8} instructions done "
        f"in {(end - start) / 10**6:.2f}ms."
    )

EDIT: strikedthrough cprofile benchmark from the FakeMoment class

KabirDubey avatar Mar 25 '25 00:03 KabirDubey

Also tried the same benchmark, with cprofile. See Kabir_M3Air.cprofile in our drive. My file is 17 KB, but Yiming's 1.2 MB--not sure how to explain this.

That is because you benchmarked the replication test whereas Yiming benchmarked main.py.

nelimee avatar Mar 25 '25 11:03 nelimee

That is because you benchmarked the replication test whereas Yiming benchmarked main.py.

My bad, didn't realize main.py was a fixed file. Updated drive with my cprofile file and here are my time stats:

(.tqec-venv) kabirdubey@Kabirs-MacBook-Air tqec % python -m pstats benchmark.cprofile
Welcome to the profile statistics browser.
benchmark.cprofile% sort tottime
benchmark.cprofile% stats 20
Tue Mar 25 16:13:27 2025    benchmark.cprofile

         34965312 function calls (34642524 primitive calls) in 84.391 seconds

   Ordered by: internal time
   List reduced from 7996 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   534414   61.637    0.000   61.637    0.000 /Users/kabirdubey/Projects/dev/tqec/src/tqec/circuit/moment.py:282(instructions)
21630/1344    5.548    0.000    0.319    0.000 /Users/kabirdubey/Projects/dev/tqec/src/tqec/circuit/qubit.py:107(count_qubit_accesses)
    41916    4.261    0.000    4.264    0.000 /Users/kabirdubey/Projects/dev/tqec/src/tqec/circuit/moment.py:123(<genexpr>)
    18681    1.359    0.000    1.361    0.000 /Users/kabirdubey/.pyenv/versions/3.12.5/envs/.tqec-venv/lib/python3.12/site-packages/tqecd/utils.py:201(<genexpr>)
    49214    0.864    0.000    0.879    0.000 /Users/kabirdubey/.pyenv/versions/3.12.5/envs/.tqec-venv/lib/python3.12/site-packages/tqecd/pauli.py:75(to_stim_pauli_string)
    10059    0.680    0.000    0.680    0.000 /Users/kabirdubey/.pyenv/versions/3.12.5/envs/.tqec-venv/lib/python3.12/site-packages/tqecd/utils.py:287(<genexpr>)
  2044067    0.604    0.000    0.865    0.000 /Users/kabirdubey/.pyenv/versions/3.12.5/envs/.tqec-venv/lib/python3.12/site-packages/tqecd/pauli.py:127(anticommutes)
  7378213    0.473    0.000    0.473    0.000 {method 'keys' of 'dict' objects}
    49214    0.397    0.000    0.507    0.000 /Users/kabirdubey/.pyenv/versions/3.12.5/envs/.tqec-venv/lib/python3.12/site-packages/tqecd/pauli.py:61(from_stim_pauli_string)

KabirDubey avatar Mar 25 '25 21:03 KabirDubey