qiskit Optimize BarrierBeforeFinalMeasurement pass

Summary

The barrier before final measurement pass previously was working by iterating over the DAGCircuit to find all the barriers and measurements and then evaluating if those operations were at the end of the circuit, or adjacent to only barriers prior to the end of the circuit. However, this was fairly inefficient as it means the performance of the pass scales as a function of the number of gates in the circuit. This commit optimizes the performance as a function by looking at the predecessors of each qubit's output nodes to find final measurements instead of iterating over the entire circuit.

Details and comments

Feb 07 '24 18:02 mtreinish

One or more of the the following people are requested to review this:

@Qiskit/terra-core

Feb 07 '24 18:02 qiskit-bot

Pull Request Test Coverage Report for Build 7819304021

0 of 19 (100.0%) changed or added relevant lines in 1 file are covered.
9 unchanged lines in 3 files lost coverage.
Overall coverage increased (+0.009%) to 89.196%

Files with Coverage Reduction	New Missed Lines	%
qiskit/dagcircuit/dagcircuit.py	1	90.77%
crates/qasm2/src/lex.rs	2	91.44%
crates/qasm2/src/parse.rs	6	97.62%
<!--	Total:	9

Totals
Change from base Build 7818682912:	0.009%
Covered Lines:	58820
Relevant Lines:	65945

💛 - Coveralls

Feb 07 '24 18:02 coveralls

Running this under asv and doing some manual testing show that this is measurably slower than the existing pass. So some more thought will be needed on the approach here. I'm marking this as on hold until I come up with a better solution.

Feb 07 '24 19:02 mtreinish

Hi @mtreinish, what is a good set of examples to experiment with?

I was experimenting earlier today with the following circuit (Bernstein-Vazirani):

nq = 5000
qr = QuantumRegister(nq, 'q')
anc = QuantumRegister(1, 'ancilla')
cr = ClassicalRegister(nq, 'c')
qc = QuantumCircuit(qr, anc, cr)
qc.x(anc[0])
qc.h(anc[0])
qc.h(qr[0:nq])
qc.cx(qr[0:nq], anc[0])
qc.h(qr[0:nq])
qc.barrier()
qc.measure(qr, cr)

qct = BarrierBeforeFinalMeasurements()(qc)

Some quick observations (on my laptop):

By far the biggest chunk of the time is spent on remove_op_node, which I believe would be fixed by #11677.

The proposed change from

ordered_final_nodes = [node for node in dag.topological_op_nodes() if node in set(final_ops)]

to

final_ops_set = set(final_ops)
ordered_final_nodes = [node for node in dag.topological_op_nodes() if node in final_ops_set]

is very helpful.

The collection part below

for candidate_node in dag.named_nodes(*final_op_types):
    is_final_op = True

    for _, child_successors in dag.bfs_successors(candidate_node):
        if any(
            isinstance(suc, DAGOpNode) and suc.name not in final_op_types
            for suc in child_successors
        ):
            is_final_op = False
            break

    if is_final_op:
        final_ops.append(candidate_node)

was actually very fast (compared to the above), with the following being just a tiny bit better

bc = BlockCollector(dag)
bc._collect_from_back = True
bc._setup_in_degrees()
final_ops = bc.collect_matching_block(filter_fn=lambda n: n.op.name in final_op_types)  # UPDATE: more concise
final_ops = final_ops[::-1]  # UPDATE: reverse the ops to match the original order

Unfortunately, the interface for collecting a single block is not really there (my fault), this is why I had to manually set the options and call the setup function.

Feb 08 '24 16:02 alexanderivrii

Sasha: if you want to try out what it would look like with a fixed remove_op_node, try installing Rustworkx from this PR: https://github.com/Qiskit/rustworkx/pull/1083 and then modifying DAGCircuit.remove_op_node to call self._multi_graph.remove_node_retain_edges_by_key(node) (with no key function) and it should just work.

Feb 08 '24 16:02 jakelishman

Hi @mtreinish, what is a good set of examples to experiment with?

I was testing super deep circuits like a depth of ~60k with only final measurements and only 50 qubits or so. I had seen a specific case where this pass was scaling as a function of number of gates very poorly and was trying to fix that. TBH, I didn't look at a profile first (and I should have), I just intuitively (or lack of intuition) jumped to switching form iterating over dag.nodes() to just the output nodes as being the root cause of the slowdown.

Some quick observations (on my laptop):

1. By far the biggest chunk of the time is spent on `remove_op_node`, which I believe would be fixed by [Fix overhead of `DAGCircuit.remove_op_node` #11677](https://github.com/Qiskit/qiskit/issues/11677).

Oh interesting. I guess because we have to remove all the nodes in final_ops_set we end up deleting a lot of individual nodes from the dag. The specific thing I was looking at in #11677 was how the scaling of remove_op_node is poor because it's quadratic as a function to the number of arguments on the instruction. But I guess even with a lot of 2 argument ops in the dag that cost adds up there too.

2. The proposed change from
   ```
   ordered_final_nodes = [node for node in dag.topological_op_nodes() if node in set(final_ops)]
   ```
   to
   ```
   final_ops_set = set(final_ops)
   ordered_final_nodes = [node for node in dag.topological_op_nodes() if node in final_ops_set]
   ```        
   is very helpful.

Nice, yeah I only fixed that out of habit when I saw

3. The collection part below
   ```
   for candidate_node in dag.named_nodes(*final_op_types):
       is_final_op = True
   
       for _, child_successors in dag.bfs_successors(candidate_node):
           if any(
               isinstance(suc, DAGOpNode) and suc.name not in final_op_types
               for suc in child_successors
           ):
               is_final_op = False
               break
   
       if is_final_op:
           final_ops.append(candidate_node)
   ```
     
   was actually very fast (compared to the above),

Yeah this was really surprising to me, I saw that loop over the named_nodes() and thought it would be very poorly scaling because that just does:

for node in dag._multi_graph.nodes():
....

and checks for either barrier or measure in insertion order and then does the bfs_successors() call from each match. So I just assumed that going about it from the output nodes would be more efficient. But the way I wrote this ends up doing the bfs_predecessors() calls than the old search code which ends up being much slower. I was trying to make it more efficient but I'll just defer to your suggestion below.

with the following being just a tiny bit better ``` bc = BlockCollector(dag) bc._collect_from_back = True bc._setup_in_degrees()

   def filter_fn(node):
        return node.op.name in final_op_types
   
   final_ops = bc.collect_matching_block(filter_fn)
   ```          
     
   Unfortunately, the interface for collecting a single block is not really there (my fault), this is why I had to manually set the options and call the setup function.

Cool, I can update the PR to do this instead for the search.

Feb 08 '24 17:02 mtreinish

The circuit that I've used does have a 5000-qubit barrier. Without Jake's PR the time to remove it is about 20 seconds (on my laptop), with Jake's PR it's 0.1 seconds (old code) or 0.5 seconds (with block-collection code).

Jake, I have modified DAGCircuit.remove_op_node to call self._multi_graph.remove_node_retain_edges_by_key(node._node_id), please confirm that this is correct. (Otherwise it complains that DAGOpNode cannot be interpreted as an integer).

The discrepancy of 0.1 vs. 0.5 seconds seems to be related to the order in which nodes are removed (I will try to look a bit deeper into this).

UPDATE: oh, of course, the block collector collects nodes backwards from the end of the block, so the nodes appear in the opposite order; after this is fixed, node removal is also 0.1 seconds.

Feb 09 '24 08:02 alexanderivrii

Jake, I have modified DAGCircuit.remove_op_node to call self._multi_graph.remove_node_retain_edges_by_key(node._node_id), please confirm that this is correct. (Otherwise it complains that DAGOpNode cannot be interpreted as an integer).

Yeah that's right, it does have to be the index.

Feb 09 '24 09:02 jakelishman

qiskit qiskit copied to clipboard

Optimize BarrierBeforeFinalMeasurement pass

Summary

Details and comments

Pull Request Test Coverage Report for Build 7819304021

💛 - Coveralls

qiskit
qiskit copied to clipboard