qiskit
qiskit copied to clipboard
Optimize BarrierBeforeFinalMeasurement pass
Summary
The barrier before final measurement pass previously was working by iterating over the DAGCircuit to find all the barriers and measurements and then evaluating if those operations were at the end of the circuit, or adjacent to only barriers prior to the end of the circuit. However, this was fairly inefficient as it means the performance of the pass scales as a function of the number of gates in the circuit. This commit optimizes the performance as a function by looking at the predecessors of each qubit's output nodes to find final measurements instead of iterating over the entire circuit.
Details and comments
One or more of the the following people are requested to review this:
@Qiskit/terra-core
Pull Request Test Coverage Report for Build 7819304021
- 0 of 19 (100.0%) changed or added relevant lines in 1 file are covered.
- 9 unchanged lines in 3 files lost coverage.
- Overall coverage increased (+0.009%) to 89.196%
| Files with Coverage Reduction | New Missed Lines | % |
|---|---|---|
| qiskit/dagcircuit/dagcircuit.py | 1 | 90.77% |
| crates/qasm2/src/lex.rs | 2 | 91.44% |
| crates/qasm2/src/parse.rs | 6 | 97.62% |
| <!-- | Total: | 9 |
| Totals | |
|---|---|
| Change from base Build 7818682912: | 0.009% |
| Covered Lines: | 58820 |
| Relevant Lines: | 65945 |
💛 - Coveralls
Running this under asv and doing some manual testing show that this is measurably slower than the existing pass. So some more thought will be needed on the approach here. I'm marking this as on hold until I come up with a better solution.
Hi @mtreinish, what is a good set of examples to experiment with?
I was experimenting earlier today with the following circuit (Bernstein-Vazirani):
nq = 5000
qr = QuantumRegister(nq, 'q')
anc = QuantumRegister(1, 'ancilla')
cr = ClassicalRegister(nq, 'c')
qc = QuantumCircuit(qr, anc, cr)
qc.x(anc[0])
qc.h(anc[0])
qc.h(qr[0:nq])
qc.cx(qr[0:nq], anc[0])
qc.h(qr[0:nq])
qc.barrier()
qc.measure(qr, cr)
qct = BarrierBeforeFinalMeasurements()(qc)
Some quick observations (on my laptop):
-
By far the biggest chunk of the time is spent on
remove_op_node, which I believe would be fixed by #11677. -
The proposed change from
ordered_final_nodes = [node for node in dag.topological_op_nodes() if node in set(final_ops)]to
final_ops_set = set(final_ops) ordered_final_nodes = [node for node in dag.topological_op_nodes() if node in final_ops_set]is very helpful.
-
The collection part below
for candidate_node in dag.named_nodes(*final_op_types): is_final_op = True for _, child_successors in dag.bfs_successors(candidate_node): if any( isinstance(suc, DAGOpNode) and suc.name not in final_op_types for suc in child_successors ): is_final_op = False break if is_final_op: final_ops.append(candidate_node)was actually very fast (compared to the above), with the following being just a tiny bit better
bc = BlockCollector(dag) bc._collect_from_back = True bc._setup_in_degrees() final_ops = bc.collect_matching_block(filter_fn=lambda n: n.op.name in final_op_types) # UPDATE: more concise final_ops = final_ops[::-1] # UPDATE: reverse the ops to match the original orderUnfortunately, the interface for collecting a single block is not really there (my fault), this is why I had to manually set the options and call the setup function.
Sasha: if you want to try out what it would look like with a fixed remove_op_node, try installing Rustworkx from this PR: https://github.com/Qiskit/rustworkx/pull/1083 and then modifying DAGCircuit.remove_op_node to call self._multi_graph.remove_node_retain_edges_by_key(node) (with no key function) and it should just work.
Hi @mtreinish, what is a good set of examples to experiment with?
I was testing super deep circuits like a depth of ~60k with only final measurements and only 50 qubits or so. I had seen a specific case where this pass was scaling as a function of number of gates very poorly and was trying to fix that. TBH, I didn't look at a profile first (and I should have), I just intuitively (or lack of intuition) jumped to switching form iterating over dag.nodes() to just the output nodes as being the root cause of the slowdown.
Some quick observations (on my laptop):
1. By far the biggest chunk of the time is spent on `remove_op_node`, which I believe would be fixed by [Fix overhead of `DAGCircuit.remove_op_node` #11677](https://github.com/Qiskit/qiskit/issues/11677).
Oh interesting. I guess because we have to remove all the nodes in final_ops_set we end up deleting a lot of individual nodes from the dag. The specific thing I was looking at in #11677 was how the scaling of remove_op_node is poor because it's quadratic as a function to the number of arguments on the instruction. But I guess even with a lot of 2 argument ops in the dag that cost adds up there too.
2. The proposed change from ``` ordered_final_nodes = [node for node in dag.topological_op_nodes() if node in set(final_ops)] ``` to ``` final_ops_set = set(final_ops) ordered_final_nodes = [node for node in dag.topological_op_nodes() if node in final_ops_set] ``` is very helpful.
Nice, yeah I only fixed that out of habit when I saw
3. The collection part below ``` for candidate_node in dag.named_nodes(*final_op_types): is_final_op = True for _, child_successors in dag.bfs_successors(candidate_node): if any( isinstance(suc, DAGOpNode) and suc.name not in final_op_types for suc in child_successors ): is_final_op = False break if is_final_op: final_ops.append(candidate_node) ``` was actually very fast (compared to the above),
Yeah this was really surprising to me, I saw that loop over the named_nodes() and thought it would be very poorly scaling because that just does:
for node in dag._multi_graph.nodes():
....
and checks for either barrier or measure in insertion order and then does the bfs_successors() call from each match. So I just assumed that going about it from the output nodes would be more efficient. But the way I wrote this ends up doing the bfs_predecessors() calls than the old search code which ends up being much slower. I was trying to make it more efficient but I'll just defer to your suggestion below.
with the following being just a tiny bit better ``` bc = BlockCollector(dag) bc._collect_from_back = True bc._setup_in_degrees()
def filter_fn(node): return node.op.name in final_op_types final_ops = bc.collect_matching_block(filter_fn) ``` Unfortunately, the interface for collecting a single block is not really there (my fault), this is why I had to manually set the options and call the setup function.
Cool, I can update the PR to do this instead for the search.
The circuit that I've used does have a 5000-qubit barrier. Without Jake's PR the time to remove it is about 20 seconds (on my laptop), with Jake's PR it's 0.1 seconds (old code) or 0.5 seconds (with block-collection code).
Jake, I have modified DAGCircuit.remove_op_node to call self._multi_graph.remove_node_retain_edges_by_key(node._node_id), please confirm that this is correct. (Otherwise it complains that DAGOpNode cannot be interpreted as an integer).
The discrepancy of 0.1 vs. 0.5 seconds seems to be related to the order in which nodes are removed (I will try to look a bit deeper into this).
UPDATE: oh, of course, the block collector collects nodes backwards from the end of the block, so the nodes appear in the opposite order; after this is fixed, node removal is also 0.1 seconds.
Jake, I have modified
DAGCircuit.remove_op_nodeto callself._multi_graph.remove_node_retain_edges_by_key(node._node_id), please confirm that this is correct. (Otherwise it complains thatDAGOpNodecannot be interpreted as an integer).
Yeah that's right, it does have to be the index.