Improve performance for MacOS users
Describe the bug
It seems like MacOS users are experiencing poor performances when building circuits with TQEC. It would be nice to be able to measure that objectively.
Steps to reproduce the behavior
With main.py being
from tqec import Basis, NoiseModel, compile_block_graph
from tqec.gallery.cnot import cnot
block_graph = cnot(Basis.Z)
correlation_surfaces = block_graph.find_correlation_surfaces()
compiled_computation = compile_block_graph(
block_graph, observables=[correlation_surfaces[1]]
)
circuit = compiled_computation.generate_stim_circuit(
k=2,
noise_model=NoiseModel.uniform_depolarizing(0.001),
)
do the following
python -m pip install tqec[bench]
python -m pyinstrument -o benchmark.html -r html main.py
trying to reduce as much as possible the parallel load on your computer (if possible, close all other applications, do nothing on your computer during the benchmark, ...).
Then, share the following information:
- the
benchmark.htmlfile that has been generated (I have vague memories of GitHub not accepting such files as attachments, if that is still the case I'll open a discussion on the Google group), - as many details about your computer as you can (OS, processor, amount of RAM, Python version, output of
python -m pip freeze, ...).
For laptop users only:
- first, do the benchmark with your regular setup (i.e., without touching anything related to power),
- if you have the time to do so, it would also be interesting to re-do the benchmark with your laptop plugged-in and in charge mode,
- if you have even more time and willingness it would be interesting to try to disable power saving options and re-do the benchmark.
For reference, on my computer:
-
python main.pytakes ~14.5s, -
python -m pyinstrument -o benchmark.html -r html main.pytakes ~22.5s.
I can confirm this issue might be faced by mac users mostly.
main.py takes approximately 11-12 s on my ThinkPad.
OS - Linux Mint 21.3 Cinnamon 6.0.4 processor - 13th Gen Intel Core i5-1335U x 10 amount of RAM - 16 GB Python version - 3.13.2
Link to html file: https://drive.google.com/file/d/1cmDfktq1KtC6ZVfJVsPPWRLW7YNMogT0/view?usp=sharing
Arch Linux Intel i5-14600KF (20) @ 5.30 GHz 32GB RAM Python 3.12.6
python main.py ~10s
python -m pyinstrument -o benchmark.html -r html main.py ~14s
After a discussion and live testing with Ángela:
At first glance, the problem seems to be independent from a particular tqec module: it seems like every function call is slowed down. I will dig more into this later.
As a first "solution" for MacOS users (everyone will benefit from this, but MacOS users will likely see a huge improvement), you can try to use DetectorDatabase.
from pathlib import Path
from tqec import Basis, NoiseModel, compile_block_graph
from tqec.compile.detectors.database import DetectorDatabase
from tqec.gallery.cnot import cnot
block_graph = cnot(Basis.Z)
correlation_surfaces = block_graph.find_correlation_surfaces()
compiled_computation = compile_block_graph(
block_graph, observables=[correlation_surfaces[1]]
)
database_path = Path("./database_cnot.pkl")
if database_path.exists():
database = DetectorDatabase.from_file(database_path)
else:
database = DetectorDatabase()
circuit = compiled_computation.generate_stim_circuit(
k=2,
noise_model=NoiseModel.uniform_depolarizing(0.001),
detector_database=database,
)
database.to_file(database_path)
A few subtleties that should be noted:
- the first run will see a modest improvement, this is because the database needs to be populated, but some computations can still be avoided,
- the second run (and any subsequent run with a populated database) will see a huge boost in performance,
- the database should be valid whatever the computation / value of
k: you can re-use the same database, over and over again, even when changing the computation or value ofk. Note that in the code above, the database is unconditionally saved, overwriting the existing one. - there is a plateau phenomenon on
k: for small values ofk(something like[1, 5]but that depends on the computation), increasingkalso increases the time it takes to generate the circuit. As soon as the plateau is reached, increasingkshould have a negligible impact on performance. In other words, generating with a populated database fork=20and fork=30should take a similar time.
For reference, for the CNOT with k=2, on my computer:
- Without database: ~16s.
- First run with the database: ~10s.
- Second and subsequent run with the database fully populated: 2s.
From memory, on Ángela M1 mac:
- Without database: ~180s.
- First run with the database: ~80s.
- Second and subsequent run with the database fully populated: 8s.
Note that, for my computer, I made a few benchmarks when introducing the DetectorDatabase. You can find them on this comment.
I'm running Windows on quite an old ThinkPad and this made a big difference to me too: Without database: 94s 1st run with database: 50s 2nd run with database: 10s.
Computer specs: OS: Windows 10 Processor: Intel i5-3320 M @ 2.60 GHz RAM: 8GB Python: 3.12.6
Computer specs:
- Asus Laptop. x64.
- OS: Windows 11.
- Processor: Intel i7-1065G7 @ 1.30GHz, 4 cores.
- RAM: 16GB
- Video: Intel Iris Plus Graphics.
- Python. Python 3.12.6 (running from venv).
Times:
-
main.py: 29s (plugged), 32s (unplugged or && and with external screen attached). -
python -m pyinstrument -o benchmark.html -r html main.py: 40s (plugged), 44s (unplugged or && and with external screen attached).
Looking at the benchmarks everyone sent, it seems like there might be an issue with the following line:
https://github.com/tqec/tqec/blob/503c7dca6297a4fc7f79a46480389d7b0dcf299f/src/tqec/circuit/moment.py#L281
From the provided benchmarks, at one place in the code, the above line takes:
- 3.4% of the total execution time on my computer (Linux),
- 7.8% of the total execution time on J's computer (Windows 11),
- 17.1% of the total execution time on Ángela's computer (MacOS with M2 chip),
- 33.6% of the total execution time on Kabir's computer (MacOS with M3 chip).
I tried to replicate the workload with the following code:
import time
from typing import Iterator
import stim
def iterate_flat_circuit(circuit: stim.Circuit) -> Iterator[stim.CircuitInstruction]:
yield from circuit # type: ignore
for rounds in [10, 100, 1000, 10000, 100000]:
circuit = stim.Circuit.generated(
"surface_code:rotated_memory_z", distance=11, rounds=rounds
).flattened()
start = time.time_ns()
instructions_count = sum(1 for _ in iterate_flat_circuit(circuit))
end = time.time_ns()
print(
f"{rounds:>6} rounds, {instructions_count:>8} instructions done "
f"in {(end - start) / 10**6:.2f}ms."
)
You do not have to use pyinstrument anymore, just run the code and copy-paste the output here.
For reference, on my computer, here are the results:
10 rounds, 1584 instructions done in 1.25ms.
100 rounds, 13644 instructions done in 9.59ms.
1000 rounds, 134244 instructions done in 91.45ms.
10000 rounds, 1340244 instructions done in 907.99ms.
100000 rounds, 13400244 instructions done in 9076.81ms.
For MacOS users, you do not have to finish the benchmark. If it takes too much time on your machine, stop the execution and report only what has been benchmarked. Note that the time scales linearly on my machine, which is exactly what is expected, so having the first 3 points should already be sufficient to have a good enough idea of the performance.
If MacOS users are experiencing slowdowns, then that may be due to pre-compiled binaries of stim not being as well optimised on MacOS as on Linux. More investigations will have to be performed once we have the benchmark results of everyone.
Note that the last message was a call for action. I do not have any MacOS-based machine on hand, so I cannot benchmark things by myself. If you have a MacOS-based machine and would like TQEC to be more efficient, please answer here with the timings returned by the code above.
Note that the last message was a call for action. I do not have any MacOS-based machine on hand, so I cannot benchmark things by myself. If you have a MacOS-based machine and would like TQEC to be more efficient, please answer here with the timings returned by the code above.
I can do a benchmark on my M1Pro Mac at tomorrow.
Note that the last message was a call for action. I do not have any MacOS-based machine on hand, so I cannot benchmark things by myself. If you have a MacOS-based machine and would like TQEC to be more efficient, please answer here with the timings returned by the code above.
10 rounds, 1584 instructions done in 2.26ms.
100 rounds, 13644 instructions done in 11.19ms.
1000 rounds, 134244 instructions done in 111.59ms.
10000 rounds, 1340244 instructions done in 1094.66ms.
100000 rounds, 13400244 instructions done in 11764.16ms.
Ran this on a 16 GB Apple M3 macOS 14.6.1. I wrote some more specs in the benchmarks thread on the Google group with subject "Sharing benchmarks" (major differences are that my laptop was charging and I was running more apps). Thanks, Adrien!
Note that the last message was a call for action. I do not have any MacOS-based machine on hand, so I cannot benchmark things by myself. If you have a MacOS-based machine and would like TQEC to be more efficient, please answer here with the timings returned by the code above.
10 rounds, 1584 instructions done in 2.26ms. 100 rounds, 13644 instructions done in 11.19ms. 1000 rounds, 134244 instructions done in 111.59ms. 10000 rounds, 1340244 instructions done in 1094.66ms. 100000 rounds, 13400244 instructions done in 11764.16ms.Ran this on a 16 GB Apple M3 macOS 14.6.1. I wrote some more specs in the benchmarks thread on the Google group with subject "Sharing benchmarks" (major differences are that my laptop was charging and I was running more apps). Thanks, Adrien!
Humm, that's not what I expected. Could you please re-run the original benchmark?
main.py
from tqec import Basis, NoiseModel, compile_block_graph
from tqec.gallery.cnot import cnot
block_graph = cnot(Basis.Z)
correlation_surfaces = block_graph.find_correlation_surfaces()
compiled_computation = compile_block_graph(
block_graph, observables=[correlation_surfaces[1]]
)
circuit = compiled_computation.generate_stim_circuit(
k=2,
noise_model=NoiseModel.uniform_depolarizing(0.001),
)
and
python -m pyinstrument -o benchmark.html -r html main.py
and share the resulting .html file to the shared Drive folder linked here: https://groups.google.com/g/tqec-design-automation/c/fUvzugEbNyY ? Make that benchmark with your laptop charging if possible, and do not overwrite your benchmark with your laptop on battery as I would like to compare.
and share the resulting
.htmlfile to the shared Drive folder linked here: https://groups.google.com/g/tqec-design-automation/c/fUvzugEbNyY ? Make that benchmark with your laptop charging if possible, and do not overwrite your benchmark with your laptop on battery as I would like to compare.
Done, see file titled kabir_laptop_charging
Ok, let's change the profiling library to get a different granularity (anyone with a Mac is encouraged to do so, the more data we get, the quicker we might be able to spot the performance problem).
First, install the line_profiler package with python -m pip install line_profiler.
In src/tqec/circuit/moment.py add the following lines:
from line_profiler import profile
# Code from Moment class ...
@profile
# def with_mapped_qubit_indices(self, qubit_index_map: dict[int, int]) -> Moment:
In other words, add the profile decorator from the line_profile package to the with_mapped_qubit_indices method from the Moment class in src/tqec/circuit/moment.py.
In src/tqec/circuit/schedule/manipulation.py do the same:
from line_profiler import profile
# Code for several functions
@profile
# def merge_scheduled_circuits(
# circuits: list[ScheduledCircuit],
# global_qubit_map: QubitMap,
# mergeable_instructions: Iterable[str] = (),
# ) -> ScheduledCircuit:
Then run the original main.py:
from tqec import Basis, NoiseModel, compile_block_graph
from tqec.gallery.cnot import cnot
block_graph = cnot(Basis.Z)
correlation_surfaces = block_graph.find_correlation_surfaces()
compiled_computation = compile_block_graph(
block_graph, observables=[correlation_surfaces[1]]
)
circuit = compiled_computation.generate_stim_circuit(
k=2,
noise_model=NoiseModel.uniform_depolarizing(0.001),
)
by using the following
LINE_PROFILE=1 python main.py
This should output a message like
Timer unit: 1e-09 s
5.81 seconds - /.../tqec/src/tqec/circuit/moment.py:335 - with_mapped_qubit_indices
Wrote profile results to profile_output.txt
Wrote profile results to profile_output_2025-03-18T165611.txt
Wrote profile results to profile_output.lprof
To view details run:
python -m line_profiler -rtmz profile_output.lprof
Share here the profile_output.txt file (you can remove information from the paths if you do not want your name to appear here).
As a reference, here is what I get:
Timer unit: 1e-09 s
Total time: 4.72901 s
File: /workspaces/tqec/src/tqec/circuit/schedule/manipulation.py
Function: merge_scheduled_circuits at line 244
Line # Hits Time Per Hit % Time Line Contents
==============================================================
244 @profile
245 def merge_scheduled_circuits(
246 circuits: list[ScheduledCircuit],
247 global_qubit_map: QubitMap,
248 mergeable_instructions: Iterable[str] = (),
249 ) -> ScheduledCircuit:
250 """Merge several :class:`~tqec.circuit.schedule.circuit.ScheduledCircuit`
251 instances into one instance.
252
253 This function takes several **compatible** scheduled circuits as input and
254 merge them, respecting their schedules, into a unique
255 :class:`~tqec.circuit.schedule.circuit.ScheduledCircuit` instance that will
256 then be returned to the caller.
257
258 The provided circuits should be compatible between each other. Compatible
259 circuits are circuits that can all be described with a unique global qubit
260 map. In other words, if two circuits from the list of compatible circuits
261 use the same qubit index, that should mean that they use the same qubit.
262 You can obtain compatible circuits by using
263 :func:`relabel_circuits_qubit_indices`.
264
265 Args:
266 circuits: **compatible** circuits to merge.
267 qubit_map: global qubit map for all the provided ``circuits``.
268 mergeable_instructions: a list of instruction names that are considered
269 mergeable. Duplicate instructions with a name in this list will be
270 merged into a single instruction.
271
272 Returns:
273 a circuit representing the merged scheduled circuits given as input.
274 """
275 1449 38175632.0 26346.2 0.8 scheduled_circuits = _ScheduledCircuits(circuits, global_qubit_map)
276
277 1449 296846.0 204.9 0.0 all_moments: list[Moment] = []
278 1449 5727508.0 3952.7 0.1 all_schedules = Schedule()
279 54222 50714062.0 935.3 1.1 global_i2q = QubitMap({i: q for q, i in scheduled_circuits.q2i.items()})
280
281 10143 47266490.0 4660.0 1.0 while scheduled_circuits.has_pending_moment():
282 8694 341585799.0 39289.8 7.2 schedule, moments = scheduled_circuits.collect_moments_at_minimum_schedule()
283 # Flatten the moments into a list of operations to perform some modifications
284 17388 888520376.0 51099.6 18.8 instructions: list[stim.CircuitInstruction] = sum(
285 8694 3587129.0 412.6 0.1 (list(moment.instructions) for moment in moments), start=[]
286 )
287 # Avoid duplicated operations. Any operation that have the Plaquette.get_mergeable_tag() tag
288 # is considered mergeable, and can be removed if another operation in the list
289 # is considered equal (and has the mergeable tag).
290 17388 1543090526.0 88744.6 32.6 deduplicated_instructions = remove_duplicate_instructions(
291 8694 1256366.0 144.5 0.0 instructions,
292 8694 4002590.0 460.4 0.1 mergeable_instruction_names=frozenset(mergeable_instructions),
293 )
294 8694 269700961.0 31021.5 5.7 merged_instructions = merge_instructions(deduplicated_instructions)
295 8694 8874976.0 1020.8 0.2 circuit = stim.Circuit()
296 20286 7542743.0 371.8 0.2 for inst in merged_instructions:
297 23184 148588707.0 6409.1 3.1 circuit.append(
298 11592 6126590.0 528.5 0.1 inst.name,
299 11592 319458009.0 27558.5 6.8 sum(_sort_target_groups(inst.target_groups()), start=[]),
300 11592 6857295.0 591.6 0.1 inst.gate_args_copy(),
301 )
302 8694 969022955.0 111458.8 20.5 all_moments.append(Moment(circuit))
303 8694 35410726.0 4073.0 0.7 all_schedules.append(schedule)
304
305 1449 33204071.0 22915.2 0.7 return ScheduledCircuit(all_moments, all_schedules, global_i2q, _avoid_checks=True)
Total time: 5.77779 s
File: /workspaces/tqec/src/tqec/circuit/moment.py
Function: with_mapped_qubit_indices at line 335
Line # Hits Time Per Hit % Time Line Contents
==============================================================
335 @profile
336 def with_mapped_qubit_indices(self, qubit_index_map: dict[int, int]) -> Moment:
337 """Map the qubits **indices** the :class:`Moment` instance is applied
338 on.
339
340 Note:
341 This method has to iterate over all the instructions in ``self`` and
342 change the gate target they are applied on.
343
344 Args:
345 qubit_index_map: the map used to modify the qubit targets.
346
347 Returns:
348 a modified copy of ``self`` with the qubit gate targets mapped according
349 to the provided ``qubit_index_map``.
350 """
351 135756 155407303.0 1144.8 2.7 circuit = stim.Circuit()
352 276974 1095316149.0 3954.6 19.0 for instr in self.instructions:
353 141218 27182953.0 192.5 0.5 mapped_targets: list[stim.GateTarget] = []
354 576297 312885511.0 542.9 5.4 for target in instr.targets_copy():
355 # Non qubit targets are left untouched.
356 435079 198306040.0 455.8 3.4 if not target.is_qubit_target:
357 mapped_targets.append(target)
358 continue
359 # Qubit targets are mapped using `qubit_index_map`
360 435079 341286586.0 784.4 5.9 target_qubit = cast(int, target.qubit_value)
361 870158 157767122.0 181.3 2.7 mapped_targets.append(
362 435079 1419714503.0 3263.1 24.6 stim.GateTarget(qubit_index_map[target_qubit])
363 435079 193712045.0 445.2 3.4 if not target.is_inverted_result_target
364 else stim.GateTarget(-qubit_index_map[target_qubit])
365 )
366 141218 1370112174.0 9702.1 23.7 circuit.append(instr.name, mapped_targets, instr.gate_args_copy())
367 271512 284257189.0 1046.9 4.9 return Moment(
368 135756 20835792.0 153.5 0.4 circuit,
369 570835 183084783.0 320.7 3.2 used_qubits={qubit_index_map[q] for q in self._used_qubits},
370 135756 17916959.0 132.0 0.3 _avoid_checks=True,
371 )
4.73 seconds - /workspaces/tqec/src/tqec/circuit/schedule/manipulation.py:244 - merge_scheduled_circuits
5.78 seconds - /workspaces/tqec/src/tqec/circuit/moment.py:335 - with_mapped_qubit_indices
Ok, let's change the profiling library to get a different granularity (anyone with a Mac is encouraged to do so, the more data we get, the quicker we might be able to spot the performance problem).
The command LINE_PROFILE=1 python main.py takes ~10 mins to complete and outputs neither a message nor a profile_output.txt file. Here's my git diff on main.
diff --git a/src/tqec/circuit/moment.py b/src/tqec/circuit/moment.py
index 9f162866..4b5ccbf1 100644
--- a/src/tqec/circuit/moment.py
+++ b/src/tqec/circuit/moment.py
@@ -18,6 +18,7 @@ from tqec.circuit.qubit import count_qubit_accesses, get_used_qubit_indices
from tqec.utils.exceptions import TQECException
from tqec.utils.instructions import is_annotation_instruction
+from line_profiler import profile
class Moment:
"""A collection of instructions that can be executed in parallel.
@@ -330,7 +331,8 @@ class Moment:
used_qubits=self._used_qubits,
_avoid_checks=True,
)
-
+
+ @profile
def with_mapped_qubit_indices(self, qubit_index_map: dict[int, int]) -> Moment:
"""Map the qubits **indices** the :class:`Moment` instance is applied
on.
diff --git a/src/tqec/circuit/schedule/manipulation.py b/src/tqec/circuit/schedule/manipulation.py
index c8795415..1ede3463 100644
--- a/src/tqec/circuit/schedule/manipulation.py
+++ b/src/tqec/circuit/schedule/manipulation.py
@@ -30,6 +30,8 @@ from tqec.circuit.schedule.circuit import ScheduledCircuit
from tqec.circuit.schedule.schedule import Schedule
from tqec.utils.exceptions import TQECException, TQECWarning
+from line_profiler import profile
+
class _ScheduledCircuits:
def __init__(
@@ -239,7 +241,7 @@ def merge_instructions(
for (name, args), targets in instructions_merger.items()
]
-
+@profile
def merge_scheduled_circuits(
circuits: list[ScheduledCircuit],
global_qubit_map: QubitMap,
:...skipping...
diff --git a/src/tqec/circuit/moment.py b/src/tqec/circuit/moment.py
index 9f162866..4b5ccbf1 100644
--- a/src/tqec/circuit/moment.py
+++ b/src/tqec/circuit/moment.py
@@ -18,6 +18,7 @@ from tqec.circuit.qubit import count_qubit_accesses, get_used_qubit_indices
from tqec.utils.exceptions import TQECException
from tqec.utils.instructions import is_annotation_instruction
+from line_profiler import profile
class Moment:
"""A collection of instructions that can be executed in parallel.
@@ -330,7 +331,8 @@ class Moment:
used_qubits=self._used_qubits,
_avoid_checks=True,
)
-
+
+ @profile
def with_mapped_qubit_indices(self, qubit_index_map: dict[int, int]) -> Moment:
"""Map the qubits **indices** the :class:`Moment` instance is applied
on.
diff --git a/src/tqec/circuit/schedule/manipulation.py b/src/tqec/circuit/schedule/manipulation.py
index c8795415..1ede3463 100644
--- a/src/tqec/circuit/schedule/manipulation.py
+++ b/src/tqec/circuit/schedule/manipulation.py
@@ -30,6 +30,8 @@ from tqec.circuit.schedule.circuit import ScheduledCircuit
from tqec.circuit.schedule.schedule import Schedule
from tqec.utils.exceptions import TQECException, TQECWarning
+from line_profiler import profile
+
class _ScheduledCircuits:
def __init__(
@@ -239,7 +241,7 @@ def merge_instructions(
for (name, args), targets in instructions_merger.items()
]
-
+@profile
def merge_scheduled_circuits(
circuits: list[ScheduledCircuit],
global_qubit_map: QubitMap,
~
I can try to debug, but I've never used the line_profiler library so may save some time if you take a look first.
Maybe to check:
import os
from tqec import Basis, NoiseModel, compile_block_graph
from tqec.gallery.cnot import cnot
ENV_VAR_NAME = "LINE_PROFILE"
print(f"{ENV_VAR_NAME}:", os.environ.get(ENV_VAR_NAME, "<not set>")
block_graph = cnot(Basis.Z)
correlation_surfaces = block_graph.find_correlation_surfaces()
compiled_computation = compile_block_graph(
block_graph, observables=[correlation_surfaces[1]]
)
circuit = compiled_computation.generate_stim_circuit(
k=2,
noise_model=NoiseModel.uniform_depolarizing(0.001),
)
If it does not print "LINE_PROFILE: 1" but prints "LINE_PROFILE:
I did not have any problem following Adrien's instructions when benchmark on my M1Pro laptop. Here's the output:
Timer unit: 1e-09 s
Total time: 60.1848 s
File: /Users/inm/open-source-project/tqec/src/tqec/circuit/schedule/manipulation.py
Function: merge_scheduled_circuits at line 243
Line # Hits Time Per Hit % Time Line Contents
==============================================================
243 @profile
244 def merge_scheduled_circuits(
245 circuits: list[ScheduledCircuit],
246 global_qubit_map: QubitMap,
247 mergeable_instructions: Iterable[str] = (),
248 ) -> ScheduledCircuit:
249 """Merge several :class:`~tqec.circuit.schedule.circuit.ScheduledCircuit`
250 instances into one instance.
251
252 This function takes several **compatible** scheduled circuits as input and
253 merge them, respecting their schedules, into a unique
254 :class:`~tqec.circuit.schedule.circuit.ScheduledCircuit` instance that will
255 then be returned to the caller.
256
257 The provided circuits should be compatible between each other. Compatible
258 circuits are circuits that can all be described with a unique global qubit
259 map. In other words, if two circuits from the list of compatible circuits
260 use the same qubit index, that should mean that they use the same qubit.
261 You can obtain compatible circuits by using
262 :func:`relabel_circuits_qubit_indices`.
263
264 Args:
265 circuits: **compatible** circuits to merge.
266 qubit_map: global qubit map for all the provided ``circuits``.
267 mergeable_instructions: a list of instruction names that are considered
268 mergeable. Duplicate instructions with a name in this list will be
269 merged into a single instruction.
270
271 Returns:
272 a circuit representing the merged scheduled circuits given as input.
273 """
274 1449 21752000.0 15011.7 0.0 scheduled_circuits = _ScheduledCircuits(circuits, global_qubit_map)
275
276 1449 169000.0 116.6 0.0 all_moments: list[Moment] = []
277 1449 4528000.0 3124.9 0.0 all_schedules = Schedule()
278 54222 33353000.0 615.1 0.1 global_i2q = QubitMap({i: q for q, i in scheduled_circuits.q2i.items()})
279
280 10143 25087000.0 2473.3 0.0 while scheduled_circuits.has_pending_moment():
281 8694 203874000.0 23450.0 0.3 schedule, moments = scheduled_circuits.collect_moments_at_minimum_schedule()
282 # Flatten the moments into a list of operations to perform some modifications
283 17388 4e+10 2e+06 66.3 instructions: list[stim.CircuitInstruction] = sum(
284 8694 1900000.0 218.5 0.0 (list(moment.instructions) for moment in moments), start=[]
285 )
286 # Avoid duplicated operations. Any operation that have the Plaquette.get_mergeable_tag() tag
287 # is considered mergeable, and can be removed if another operation in the list
288 # is considered equal (and has the mergeable tag).
289 17388 1e+10 582507.5 16.8 deduplicated_instructions = remove_duplicate_instructions(
290 8694 892000.0 102.6 0.0 instructions,
291 8694 2752000.0 316.5 0.0 mergeable_instruction_names=frozenset(mergeable_instructions),
292 )
293 8694 162534000.0 18695.0 0.3 merged_instructions = merge_instructions(deduplicated_instructions)
294 8694 4210000.0 484.2 0.0 circuit = stim.Circuit()
295 20286 4753000.0 234.3 0.0 for inst in merged_instructions:
296 23184 796792000.0 34368.2 1.3 circuit.append(
297 11592 3253000.0 280.6 0.0 inst.name,
298 11592 178081000.0 15362.4 0.3 sum(_sort_target_groups(inst.target_groups()), start=[]),
299 11592 3410000.0 294.2 0.0 inst.gate_args_copy(),
300 )
301 8694 8691765000.0 999742.9 14.4 all_moments.append(Moment(circuit))
302 8694 19745000.0 2271.1 0.0 all_schedules.append(schedule)
303
304 1449 22587000.0 15588.0 0.0 return ScheduledCircuit(all_moments, all_schedules, global_i2q, _avoid_checks=True)
Total time: 61.4382 s
File: /Users/inm/open-source-project/tqec/src/tqec/circuit/moment.py
Function: with_mapped_qubit_indices at line 335
Line # Hits Time Per Hit % Time Line Contents
==============================================================
335 @profile
336 def with_mapped_qubit_indices(self, qubit_index_map: dict[int, int]) -> Moment:
337 """Map the qubits **indices** the :class:`Moment` instance is applied
338 on.
339
340 Note:
341 This method has to iterate over all the instructions in ``self`` and
342 change the gate target they are applied on.
343
344 Args:
345 qubit_index_map: the map used to modify the qubit targets.
346
347 Returns:
348 a modified copy of ``self`` with the qubit gate targets mapped according
349 to the provided ``qubit_index_map``.
350 """
351 135756 82302000.0 606.2 0.1 circuit = stim.Circuit()
352 276974 4e+10 155142.2 69.9 for instr in self.instructions:
353 141218 16041000.0 113.6 0.0 mapped_targets: list[stim.GateTarget] = []
354 576297 203630000.0 353.3 0.3 for target in instr.targets_copy():
355 # Non qubit targets are left untouched.
356 435079 103588000.0 238.1 0.2 if not target.is_qubit_target:
357 mapped_targets.append(target)
358 continue
359 # Qubit targets are mapped using `qubit_index_map`
360 435079 183755000.0 422.3 0.3 target_qubit = cast(int, target.qubit_value)
361 870158 107407000.0 123.4 0.2 mapped_targets.append(
362 435079 8025480000.0 18446.0 13.1 stim.GateTarget(qubit_index_map[target_qubit])
363 435079 106704000.0 245.3 0.2 if not target.is_inverted_result_target
364 else stim.GateTarget(-qubit_index_map[target_qubit])
365 )
366 141218 9348318000.0 66197.8 15.2 circuit.append(instr.name, mapped_targets, instr.gate_args_copy())
367 271512 150417000.0 554.0 0.2 return Moment(
368 135756 12844000.0 94.6 0.0 circuit,
369 570835 114218000.0 200.1 0.2 used_qubits={qubit_index_map[q] for q in self._used_qubits},
370 135756 13182000.0 97.1 0.0 _avoid_checks=True,
371 )
60.18 seconds - /Users/inm/open-source-project/tqec/src/tqec/circuit/schedule/manipulation.py:243 - merge_scheduled_circuits
61.44 seconds - /Users/inm/open-source-project/tqec/src/tqec/circuit/moment.py:335 - with_mapped_qubit_indices
From @inmzhang benchmarks, it seems like the Moment.instructions lines are more costly on MacOS than on GNU/Linux. The code I shared should replicate that behaviour, but fails to do so. Let's get even closer to the actual code with a new benchmark:
import time
from typing import Iterator
import stim
class FakeMoment:
def __init__(self, circuit: stim.Circuit) -> None:
self._circuit = circuit
@property
def instructions(self) -> Iterator[stim.CircuitInstruction]:
yield from self._circuit
for rounds in [10, 100, 1000, 10000, 100000]:
circuit = stim.Circuit.generated(
"surface_code:rotated_memory_z", distance=11, rounds=rounds
).flattened()
moment = FakeMoment(circuit)
start = time.time_ns()
instructions_count = sum(1 for _ in moment.instructions)
end = time.time_ns()
print(
f"{rounds:>6} rounds, {instructions_count:>8} instructions done "
f"in {(end - start) / 10**6:.2f}ms."
)
For reference, on my computer and on the main branch:
10 rounds, 1584 instructions done in 1.24ms.
100 rounds, 13644 instructions done in 9.45ms.
1000 rounds, 134244 instructions done in 90.41ms.
10000 rounds, 1340244 instructions done in 884.91ms.
100000 rounds, 13400244 instructions done in 8936.75ms.
@nelimee, I left my laptop at office and will update the benchmark at tomorrow.
On the M1Pro Mac:
10 rounds, 1584 instructions done in 1.02ms.
100 rounds, 13644 instructions done in 7.75ms.
1000 rounds, 134244 instructions done in 76.34ms.
10000 rounds, 1340244 instructions done in 755.06ms.
100000 rounds, 13400244 instructions done in 7646.37ms.
So to summarise:
- Both
line_profilerandpyinstrumentshow thatMoment.instructionstake a large portion of the time on MacOS, - None of my replication trials are able to replicate the issue.
Let's try one more profiler:
python -m cProfile -o benchmark.cprofile main.py
and share the benchmark.cprofile file please.
So to summarise:
* Both `line_profiler` and `pyinstrument` show that `Moment.instructions` take a large portion of the time on MacOS, * None of my replication trials are able to replicate the issue.Let's try one more profiler:
python -m cProfile -o benchmark.cprofile main.pyand share the
benchmark.cprofilefile please.
See https://drive.google.com/file/d/1k__NFlI1kDSV-wf9Gcf3ipnP85PUW6me/view?usp=drive_link.
See https://drive.google.com/file/d/1k__NFlI1kDSV-wf9Gcf3ipnP85PUW6me/view?usp=drive_link.
Thanks a lot for the quick answer! Let's summarise.
pyinstrument
Note that the following screenshots do not show all the places where Moment.instructions is used, but show enough to highlight the issue.
From @KabirDubey with a M3 Apple chip on MacOS:
From myself on a Ryzen 9 5950X on Archlinux:
Conclusion: the relative time took by Moment.instructions is one order of magnitude higher on Kabir's laptop, which hints at an issue. Note that the absolute time is not relevant here, because the benchmark settings are very different. It is expected that an M3 chip is slower (because it is a laptop chip that is optimised for energy consumption), but it should be slower everywhere, not just on one part.
line_profiler
This is less visible but the lines involving Moment.instructions take:
- 60% to 70% of the benchmarked function time on Yiming's laptop (M1 Pro),
- ~20% of the benchmarked function time on my computer.
So even though this is not an order of magnitude, there is still a large discrepancy.
cProfile
Profiling the exact same main.py with
>>> python -m cProfile -o benchmark.cprofile main.py
and analysing the results with
>>> python -m pstats benchmark.cprofile
Welcome to the profile statistics browser.
benchmark.cprofile% sort tottime
benchmark.cprofile% stats 20
outputs the following for Yiming's (M1 Pro) laptop:
Thu Mar 20 12:52:21 2025 Yiming_M1Pro.cprofile
35610013 function calls (35274253 primitive calls) in 112.453 seconds
Ordered by: internal time
List reduced from 8098 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
534414 83.650 0.000 83.650 0.000 tqec/src/tqec/circuit/moment.py:270(instructions)
21630/1344 7.295 0.000 0.429 0.000 tqec/src/tqec/circuit/qubit.py:107(count_qubit_accesses)
41916 5.757 0.000 5.760 0.000 tqec/src/tqec/circuit/moment.py:122(<genexpr>)
18681 1.832 0.000 1.833 0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/utils.py:201(<genexpr>)
49214 0.964 0.000 0.979 0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/pauli.py:75(to_stim_pauli_string)
10059 0.916 0.000 0.917 0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/utils.py:287(<genexpr>)
102 0.785 0.008 0.788 0.008 ¨built-in method _imp.create_dynamic¼
2043366 0.704 0.000 1.003 0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/pauli.py:127(anticommutes)
7376816 0.543 0.000 0.543 0.000 ¨method 'keys' of 'dict' objects¼
20118 0.472 0.000 0.474 0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/utils.py:115(<genexpr>)
1437 0.460 0.000 0.469 0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/utils.py:165(has_only_reset_or_is_virtual)
7185 0.459 0.000 0.463 0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/utils.py:162(<genexpr>)
7185 0.459 0.000 0.460 0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/utils.py:127(<genexpr>)
1437 0.458 0.000 0.462 0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/utils.py:130(has_only_measurement_or_is_virtual)
1341/0 0.384 0.000 0.000 tqec/src/tqec/circuit/moment.py:334(with_mapped_qubit_indices)
49214 0.376 0.000 0.489 0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/pauli.py:61(from_stim_pauli_string)
68774 0.324 0.000 1.434 0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/pauli.py:135(collapse_by)
894240 0.293 0.000 0.423 0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/pauli.py:177(overlaps)
186112 0.229 0.000 0.351 0.000 tqec/.venv/lib/python3.12/site-packages/tqecd/pauli.py:33(__init__)
1372 0.212 0.000 0.212 0.000 tqec/src/tqec/circuit/moment.py:153(<genexpr>)
and the following on my computer:
Thu Mar 20 09:03:13 2025 cprofile.txt
35542180 function calls (35208252 primitive calls) in 16.457 seconds
Ordered by: internal time
List reduced from 7959 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
534414 1.606 0.000 1.606 0.000 /workspaces/tqec/src/tqec/circuit/moment.py:270(instructions)
2043538 1.257 0.000 1.862 0.000 /home/vscode/.local/lib/python3.12/site-packages/tqecd/pauli.py:127(anticommutes)
7377154 1.099 0.000 1.099 0.000 ¨method 'keys' of 'dict' objects¼
21630/1344 0.777 0.000 0.014 0.000 /workspaces/tqec/src/tqec/circuit/qubit.py:107(count_qubit_accesses)
49214 0.699 0.000 0.915 0.000 /home/vscode/.local/lib/python3.12/site-packages/tqecd/pauli.py:61(from_stim_pauli_string)
894240 0.531 0.000 0.795 0.000 /home/vscode/.local/lib/python3.12/site-packages/tqecd/pauli.py:177(overlaps)
68774 0.474 0.000 2.426 0.000 /home/vscode/.local/lib/python3.12/site-packages/tqecd/pauli.py:135(collapse_by)
1341/0 0.429 0.000 0.000 /workspaces/tqec/src/tqec/circuit/moment.py:334(with_mapped_qubit_indices)
186112 0.408 0.000 0.647 0.000 /home/vscode/.local/lib/python3.12/site-packages/tqecd/pauli.py:33(__init__)
1293686/1215862 0.266 0.000 0.281 0.000 ¨built-in method builtins.len¼
49214 0.263 0.000 0.293 0.000 /home/vscode/.local/lib/python3.12/site-packages/tqecd/pauli.py:75(to_stim_pauli_string)
1347765 0.256 0.000 0.459 0.000 /home/vscode/.local/lib/python3.12/site-packages/tqecd/pauli.py:48(qubits)
1230621 0.225 0.000 0.225 0.000 ¨method 'append' of 'list' objects¼
74439 0.224 0.000 0.695 0.000 /workspaces/tqec/src/tqec/circuit/schedule/circuit.py:29(__init__)
1205691 0.206 0.000 1.294 0.000 /home/vscode/.local/lib/python3.12/site-packages/tqecd/pauli.py:123(commutes)
1119357 0.203 0.000 0.324 0.000 /workspaces/tqec/src/tqec/circuit/qubit.py:62(__hash__)
216608/216605 0.198 0.000 0.525 0.000 ¨built-in method builtins.sorted¼
1495 0.181 0.000 0.181 0.000 ¨method 'read' of '_io.BufferedReader' objects¼
51832 0.174 0.000 1.262 0.000 /home/vscode/.local/lib/python3.12/site-packages/tqecd/boundary.py:13(__init__)
889679 0.170 0.000 0.943 0.000 /home/vscode/.local/lib/python3.12/site-packages/tqecd/boundary.py:48(<genexpr>)
Confirming what the other two profiling approaches led us to think: Moment.instructions is one of the major participant to performance issues on MacOS.
Replication
Now, the goal would be to be able to replicate that performance issue with a small and analysable code. My last two attempts failed to do so, showing similar results between MacOS and my computer. Does anyone have any idea how to replicate? I am open to ideas :)
I can take a closer look and compare between the Mac/Linux machine I have, but I can only do that next week.
@nelimee @inmzhang I am also a Mac user. I can perform the benchmarks too, but I wanted to look at the code to see what might be going on.
As you pointed out previously, casting the Stim.circuit into an iterator (or, in one case, a list) seems to be the bottleneck. Taking a glance at the implementaiton, their implementation is in C++ and involves a lot of memory operations (understandably, given how important this class is). It may be worth reaching out to the quantumlib team to inquire if any of their testing/optimization had been done on Macs or other arm processors. Likewise, they may have a recommendation on how to better extract the instructions. Just an idea I had :-)
Taking a glance at the implementaiton, their implementation is in C++ and involves a lot of memory operations (understandably, given how important this class is).
For reference, because I did not find the correct information directly:
Python "iterable" is defined as
[...] objects of any classes you define with an __iter__() method or with a __getitem__() method that implements sequence semantics.
in the glossary.
It turns out that stim.Circuit does not implement the __iter__ method (see reference and implementation), so the iteration is done using __getitem__ that is implemented with circuit_get_item.
Also, the stim CI builds on MacOS.
It may be worth reaching out to the quantumlib team to inquire if any of their testing/optimization had been done on Macs or other arm processors.
Why not, but the fact that we are not able to replicate the issue on a small example hints that the issue is at least not entirely due to stim.
@nelimee Following up on this for the sake of completeness. The source of my delay was a missing -e flag.
Results from the `line_profiler` test
Timer unit: 1e-09 s
Total time: 47.714 s
File: /Users/kabirdubey/Projects/dev/tqec/src/tqec/circuit/schedule/manipulation.py
Function: merge_scheduled_circuits at line 244
Line # Hits Time Per Hit % Time Line Contents
==============================================================
244 @profile
245 def merge_scheduled_circuits(
246 circuits: list[ScheduledCircuit],
247 global_qubit_map: QubitMap,
248 mergeable_instructions: Iterable[str] = (),
249 ) -> ScheduledCircuit:
250 """Merge several :class:`~tqec.circuit.schedule.circuit.ScheduledCircuit`
251 instances into one instance.
252
253 This function takes several **compatible** scheduled circuits as input and
254 merge them, respecting their schedules, into a unique
255 :class:`~tqec.circuit.schedule.circuit.ScheduledCircuit` instance that will
256 then be returned to the caller.
257
258 The provided circuits should be compatible between each other. Compatible
259 circuits are circuits that can all be described with a unique global qubit
260 map. In other words, if two circuits from the list of compatible circuits
261 use the same qubit index, that should mean that they use the same qubit.
262 You can obtain compatible circuits by using
263 :func:`relabel_circuits_qubit_indices`.
264
265 Args:
266 circuits: **compatible** circuits to merge.
267 qubit_map: global qubit map for all the provided ``circuits``.
268 mergeable_instructions: a list of instruction names that are considered
269 mergeable. Duplicate instructions with a name in this list will be
270 merged into a single instruction.
271
272 Returns:
273 a circuit representing the merged scheduled circuits given as input.
274 """
275 1449 25431000.0 17550.7 0.1 scheduled_circuits = _ScheduledCircuits(circuits, global_qubit_map)
276
277 1449 291000.0 200.8 0.0 all_moments: list[Moment] = []
278 1449 4693000.0 3238.8 0.0 all_schedules = Schedule()
279 54222 36757000.0 677.9 0.1 global_i2q = QubitMap({i: q for q, i in scheduled_circuits.q2i.items()})
280
281 10143 34618000.0 3413.0 0.1 while scheduled_circuits.has_pending_moment():
282 8694 237349000.0 27300.3 0.5 schedule, moments = scheduled_circuits.collect_moments_at_minimum_schedule()
283 # Flatten the moments into a list of operations to perform some modifications
284 17388 3e+10 2e+06 64.0 instructions: list[stim.CircuitInstruction] = sum(
285 8694 2723000.0 313.2 0.0 (list(moment.instructions) for moment in moments), start=[]
286 )
287 # Avoid duplicated operations. Any operation that have the Plaquette.get_mergeable_tag() tag
288 # is considered mergeable, and can be removed if another operation in the list
289 # is considered equal (and has the mergeable tag).
290 17388 8728432000.0 501980.2 18.3 deduplicated_instructions = remove_duplicate_instructions(
291 8694 1382000.0 159.0 0.0 instructions,
292 8694 4887000.0 562.1 0.0 mergeable_instruction_names=frozenset(mergeable_instructions),
293 )
294 8694 222691000.0 25614.3 0.5 merged_instructions = merge_instructions(deduplicated_instructions)
295 8694 5720000.0 657.9 0.0 circuit = stim.Circuit()
296 20286 7168000.0 353.3 0.0 for inst in merged_instructions:
297 23184 759537000.0 32761.3 1.6 circuit.append(
298 11592 3960000.0 341.6 0.0 inst.name,
299 11592 219231000.0 18912.3 0.5 sum(_sort_target_groups(inst.target_groups()), start=[]),
300 11592 4471000.0 385.7 0.0 inst.gate_args_copy(),
301 )
302 8694 6837622000.0 786476.0 14.3 all_moments.append(Moment(circuit))
303 8694 28379000.0 3264.2 0.1 all_schedules.append(schedule)
304
305 1449 25137000.0 17347.8 0.1 return ScheduledCircuit(all_moments, all_schedules, global_i2q, _avoid_checks=True)
Total time: 49.9701 s
File: /Users/kabirdubey/Projects/dev/tqec/src/tqec/circuit/moment.py
Function: with_mapped_qubit_indices at line 346
Line # Hits Time Per Hit % Time Line Contents
==============================================================
346 @profile
347 def with_mapped_qubit_indices(self, qubit_index_map: dict[int, int]) -> Moment:
348 """Map the qubits **indices** the :class:`Moment` instance is applied
349 on.
350
351 Note:
352 This method has to iterate over all the instructions in ``self`` and
353 change the gate target they are applied on.
354
355 Args:
356 qubit_index_map: the map used to modify the qubit targets.
357
358 Returns:
359 a modified copy of ``self`` with the qubit gate targets mapped according
360 to the provided ``qubit_index_map``.
361 """
362 135756 101568000.0 748.2 0.2 circuit = stim.Circuit()
363 276974 3e+10 117635.0 65.2 for instr in self.instructions:
364 141218 21062000.0 149.1 0.0 mapped_targets: list[stim.GateTarget] = []
365 576297 261380000.0 453.6 0.5 for target in instr.targets_copy():
366 # Non qubit targets are left untouched.
367 435079 133660000.0 307.2 0.3 if not target.is_qubit_target:
368 mapped_targets.append(target)
369 continue
370 # Qubit targets are mapped using `qubit_index_map`
371 435079 201052000.0 462.1 0.4 target_qubit = cast(int, target.qubit_value)
372 870158 139957000.0 160.8 0.3 mapped_targets.append(
373 435079 7462386000.0 17151.8 14.9 stim.GateTarget(qubit_index_map[target_qubit])
374 435079 131042000.0 301.2 0.3 if not target.is_inverted_result_target
375 else stim.GateTarget(-qubit_index_map[target_qubit])
376 )
377 141218 8588849000.0 60819.8 17.2 circuit.append(instr.name, mapped_targets, instr.gate_args_copy())
378 271512 179662000.0 661.7 0.4 return Moment(
379 135756 19198000.0 141.4 0.0 circuit,
380 570835 134652000.0 235.9 0.3 used_qubits={qubit_index_map[q] for q in self._used_qubits},
381 135756 13812000.0 101.7 0.0 _avoid_checks=True,
382 )
47.71 seconds - /Users/kabirdubey/Projects/dev/tqec/src/tqec/circuit/schedule/manipulation.py:244 - merge_scheduled_circuits
49.97 seconds - /Users/kabirdubey/Projects/dev/tqec/src/tqec/circuit/moment.py:346 - with_mapped_qubit_indices
I also ran the focused benchmark on the Moment class.
Results from the `FakeMomentClass` benchmark
10 rounds, 1584 instructions done in 1.57ms.
100 rounds, 13644 instructions done in 11.81ms.
1000 rounds, 134244 instructions done in 108.70ms.
10000 rounds, 1340244 instructions done in 1066.46ms.
100000 rounds, 13400244 instructions done in 11669.82ms.
~~Also tried the same benchmark, with cprofile. See Kabir_M3Air.cprofile in our drive. My file is 17 KB, but Yiming's 1.2 MB--not sure how to explain this.~~
~~Here's are the sorted total times:~~
(.tqec-venv) kabirdubey@Kabirs-MacBook-Air tqec % python -m pstats benchmark.cprofile
Welcome to the profile statistics browser.
benchmark.cprofile% sort tottime
benchmark.cprofile% stats 20
Mon Mar 24 19:27:08 2025 benchmark.cprofile
29780667 function calls (29780654 primitive calls) in 14.799 seconds
Ordered by: internal time
List reduced from 143 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
14889965 11.120 0.000 11.120 0.000 main.py:10(instructions)
14889965 2.957 0.000 14.077 0.000 main.py:21(<genexpr>)
5 0.700 0.140 14.778 2.956 {built-in method builtins.sum}
2 0.018 0.009 0.018 0.009 {built-in method _imp.create_dynamic}
5 0.002 0.000 0.002 0.000 {built-in method stim._stim_polyfill.generated}
1 0.000 0.000 0.000 0.000 {built-in method posix.listdir}
1 0.000 0.000 0.000 0.000 {method 'read' of '_io.BufferedReader' objects}
5 0.000 0.000 0.000 0.000 {built-in method builtins.print}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
16 0.000 0.000 0.000 0.000 {built-in method posix.stat}
7 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:1593(find_spec)
1 0.000 0.000 0.000 0.000 {built-in method _io.open_code}
27 0.000 0.000 0.000 0.000 /Users/kabirdubey/.pyenv/versions/3.12.5/envs/.tqec-venv/lib/python3.12/site-packages/stim/__init__.py:30(_pytest_pycharm_pybind_repr_bug_workaround)
1 0.000 0.000 0.019 0.019 /Users/kabirdubey/.pyenv/versions/3.12.5/envs/.tqec-venv/lib/python3.12/site-packages/stim/__init__.py:1(<module>)
1 0.000 0.000 0.000 0.000 {built-in method posix.getcwd}
3/1 0.000 0.000 0.019 0.019 <frozen importlib._bootstrap>:1349(_find_and_load)
29 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap_external>:126(_path_join)
3 0.000 0.000 0.001 0.000 <frozen importlib._bootstrap>:1240(_find_spec)
1 0.000 0.000 0.000 0.000 {built-in method builtins.__build_class__}
10 0.000 0.000 0.000 0.000 {built-in method time.time_ns}
~~I am working on a local copy of the latest version of main.~~
~~Contents of my main.py file~~
import time
from typing import Iterator
import stim
class FakeMoment:
def __init__(self, circuit: stim.Circuit) -> None:
self._circuit = circuit
@property
def instructions(self) -> Iterator[stim.CircuitInstruction]:
yield from self._circuit
for rounds in [10, 100, 1000, 10000, 100000]:
circuit = stim.Circuit.generated(
"surface_code:rotated_memory_z", distance=11, rounds=rounds
).flattened()
moment = FakeMoment(circuit)
start = time.time_ns()
instructions_count = sum(1 for _ in moment.instructions)
end = time.time_ns()
print(
f"{rounds:>6} rounds, {instructions_count:>8} instructions done "
f"in {(end - start) / 10**6:.2f}ms."
)
EDIT: strikedthrough cprofile benchmark from the FakeMoment class
Also tried the same benchmark, with
cprofile. SeeKabir_M3Air.cprofilein our drive. My file is 17 KB, but Yiming's 1.2 MB--not sure how to explain this.
That is because you benchmarked the replication test whereas Yiming benchmarked main.py.
That is because you benchmarked the replication test whereas Yiming benchmarked
main.py.
My bad, didn't realize main.py was a fixed file. Updated drive with my cprofile file and here are my time stats:
(.tqec-venv) kabirdubey@Kabirs-MacBook-Air tqec % python -m pstats benchmark.cprofile
Welcome to the profile statistics browser.
benchmark.cprofile% sort tottime
benchmark.cprofile% stats 20
Tue Mar 25 16:13:27 2025 benchmark.cprofile
34965312 function calls (34642524 primitive calls) in 84.391 seconds
Ordered by: internal time
List reduced from 7996 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
534414 61.637 0.000 61.637 0.000 /Users/kabirdubey/Projects/dev/tqec/src/tqec/circuit/moment.py:282(instructions)
21630/1344 5.548 0.000 0.319 0.000 /Users/kabirdubey/Projects/dev/tqec/src/tqec/circuit/qubit.py:107(count_qubit_accesses)
41916 4.261 0.000 4.264 0.000 /Users/kabirdubey/Projects/dev/tqec/src/tqec/circuit/moment.py:123(<genexpr>)
18681 1.359 0.000 1.361 0.000 /Users/kabirdubey/.pyenv/versions/3.12.5/envs/.tqec-venv/lib/python3.12/site-packages/tqecd/utils.py:201(<genexpr>)
49214 0.864 0.000 0.879 0.000 /Users/kabirdubey/.pyenv/versions/3.12.5/envs/.tqec-venv/lib/python3.12/site-packages/tqecd/pauli.py:75(to_stim_pauli_string)
10059 0.680 0.000 0.680 0.000 /Users/kabirdubey/.pyenv/versions/3.12.5/envs/.tqec-venv/lib/python3.12/site-packages/tqecd/utils.py:287(<genexpr>)
2044067 0.604 0.000 0.865 0.000 /Users/kabirdubey/.pyenv/versions/3.12.5/envs/.tqec-venv/lib/python3.12/site-packages/tqecd/pauli.py:127(anticommutes)
7378213 0.473 0.000 0.473 0.000 {method 'keys' of 'dict' objects}
49214 0.397 0.000 0.507 0.000 /Users/kabirdubey/.pyenv/versions/3.12.5/envs/.tqec-venv/lib/python3.12/site-packages/tqecd/pauli.py:61(from_stim_pauli_string)