qiskit Switch `convert_2q_block_matrix.rs` to use `matmul` from `faer` and an explicit version of `kron`

Part of #12120

This PR switches over the convert_2q_block_matrix.rs script to use matmul from faer instead of dot from ndarray to perform matrix multiplications. It turns out that using faer for matrices of size 4x4 or smaller gives us a better performance compared to ndarray as we can find in this comment.

Summary of the changes in this PR:

The use of an explicit/unrolled version of kron 2x2 that improves the performances of the kron calls
The use of matmul() from faer improving the performance of the dot() calls
Introduction of common.rs as a module with different common functions between different scripts

In a follow-up, the same change will be done for the two_qubit_decompose.rs script, which is still using ndarray, in the meantime, and working with the old change_basis function.

Apr 16 '24 17:04 arnaucasau

Thank you for opening a new pull request.

Before your PR can be merged it will first need to pass continuous integration tests and be reviewed. Sometimes the review process can be slow, so please be patient.

While you're waiting, please feel free to review other open PRs. While only a subset of people are authorized to approve pull requests for merging, everyone is encouraged to review open pull requests. Doing reviews helps reduce the burden on the core team and helps make the project's code better for everyone.

One or more of the the following people are requested to review this:

@Eric-Arellano
@Qiskit/terra-core
@kevinhartman
@mtreinish

Apr 16 '24 17:04 qiskit-bot

Pull Request Test Coverage Report for Build 8986784365

Details

84 of 86 (97.67%) changed or added relevant lines in 3 files are covered.
7 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.02%) to 89.648%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
crates/accelerate/src/convert_2q_block_matrix.rs	31	33	93.94%
<!--	Total:	84	86

Files with Coverage Reduction	New Missed Lines	%
crates/accelerate/src/convert_2q_block_matrix.rs	1	94.0%
crates/qasm2/src/lex.rs	6	92.11%
<!--	Total:	7

Totals
Change from base Build 8983739322:	0.02%
Covered Lines:	62320
Relevant Lines:	69516

💛 - Coveralls

Apr 16 '24 18:04 coveralls

Thanks for the review Matthew!

I've run the transpiler tests, but I don't see any notable improvement by only modifying this script. Similar results can be obtained by running the passes.py test that calls to ConsolidateBlocks. In this case, there are some tests with slightly better performance than before, but it's not very consistence across different executions.

Based on the benchmarks in the issue, I suspect this PR does improve the performance, but maybe this part of the code only makes up a small portion of the execution time in the asv benchmarks, making the impact inconsistently?

Results obtained:

Benchmarks that have stayed the same:

| Change   | Before [de4b5a21] <main>   | After [a34927f3] <AC/2q_blocks_faer>   | Ratio   | Benchmark (Parameter)                                                                                         |
|----------|----------------------------|----------------------------------------|---------|---------------------------------------------------------------------------------------------------------------|
|          | 474±5ms                    | 473±4ms                                | 1.00    | transpiler_benchmarks.TranspilerBenchSuite.time_compile_from_large_qasm                                       |
|          | 3.43±0.03ms                | 3.45±0.1ms                             | 1.01    | transpiler_benchmarks.TranspilerBenchSuite.time_cx_compile                                                    |
|          | 4.52±0.1ms                 | 4.44±0.04ms                            | 0.98    | transpiler_benchmarks.TranspilerBenchSuite.time_single_gate_compile                                           |
|          | 1.25±0s                    | 1.24±0s                                | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(0)                          |
|          | 520±3ms                    | 521±5ms                                | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(1)                          |
|          | 1.97±0.02s                 | 2.02±0.01s                             | 1.03    | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(2)                          |
|          | 2.53±0.01s                 | 2.56±0.02s                             | 1.01    | transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(3)                          |
|          | 157±2ms                    | 161±7ms                                | 1.02    | transpiler_levels.TranspilerLevelBenchmarks.time_schedule_qv_14_x_14(0)                                       |
|          | 136±2ms                    | 135±2ms                                | 0.99    | transpiler_levels.TranspilerLevelBenchmarks.time_schedule_qv_14_x_14(1)                                       |
|          | 64.8±1ms                   | 64.6±0.4ms                             | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(0)                                 |
|          | 137±1ms                    | 135±3ms                                | 0.98    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(1)                                 |
|          | 276±3ms                    | 275±3ms                                | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(2)                                 |
|          | 154±1ms                    | 158±0.7ms                              | 1.03    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(3)                                 |
|          | 68.3±0.8ms                 | 68.0±0.8ms                             | 0.99    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(0)               |
|          | 141±2ms                    | 142±1ms                                | 1.01    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(1)               |
|          | 273±2ms                    | 278±2ms                                | 1.02    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(2)               |
|          | 156±1ms                    | 156±2ms                                | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(3)               |
|          | 129±1ms                    | 128±0.8ms                              | 0.99    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(0)                                      |
|          | 109±1ms                    | 111±7ms                                | 1.01    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(1)                                      |
|          | 340±5ms                    | 346±1ms                                | 1.02    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(2)                                      |
|          | 388±7ms                    | 389±7ms                                | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(3)                                      |
|          | 2565                       | 2565                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(0)                   |
|          | 1403                       | 1403                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(1)                   |
|          | 1403                       | 1403                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(2)                   |
|          | 1296                       | 1296                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(3)                   |
|          | 2705                       | 2705                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(0)                          |
|          | 2005                       | 2005                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(1)                          |
|          | 2005                       | 2005                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(2)                          |
|          | 7                          | 7                                      | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(3)                          |
|          | 2705                       | 2705                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(0)        |
|          | 2005                       | 2005                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(1)        |
|          | 2005                       | 2005                                   | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(2)        |
|          | 7                          | 7                                      | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(3)        |
|          | 323                        | 323                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(0)                               |
|          | 336                        | 336                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(1)                               |
|          | 336                        | 336                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(2)                               |
|          | 272                        | 272                                    | 1.00    | transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(3)                               |
|          | 38.6±1ms                   | 38.8±2ms                               | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'sabre', 'dense')         |
|          | 37.1±0.3ms                 | 37.2±0.2ms                             | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'sabre', 'sabre')         |
|          | 85.2±0.3ms                 | 86.1±1ms                               | 1.01    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'stochastic', 'dense')    |
|          | 82.8±0.2ms                 | 83.0±0.3ms                             | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'stochastic', 'sabre')    |
|          | 43.8±1ms                   | 44.0±0.7ms                             | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'sabre', 'dense')         |
|          | 42.9±0.6ms                 | 42.2±0.2ms                             | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'sabre', 'sabre')         |
|          | 92.9±3ms                   | 90.0±0.8ms                             | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'stochastic', 'dense')    |
|          | 94.5±10ms                  | 88.6±0.7ms                             | 0.94    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'stochastic', 'sabre')    |
|          | 61.1±1ms                   | 60.1±0.5ms                             | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'sabre', 'dense')         |
|          | 58.9±0.3ms                 | 58.7±0.7ms                             | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'sabre', 'sabre')         |
|          | 116±9ms                    | 112±0.5ms                              | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'stochastic', 'dense')    |
|          | 114±6ms                    | 109±0.5ms                              | 0.95    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'stochastic', 'sabre')    |
|          | 100.0±2ms                  | 99.4±0.7ms                             | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'dense')         |
|          | 108±3ms                    | 105±0.5ms                              | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'sabre')         |
|          | 184±8ms                    | 175±1ms                                | 0.95    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'stochastic', 'dense')    |
|          | 177±3ms                    | 176±0.8ms                              | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'stochastic', 'sabre')    |
|          | 55.1±3ms                   | 52.5±0.2ms                             | 0.95    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'sabre', 'dense')         |
|          | 51.8±6ms                   | 50.9±0.5ms                             | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'sabre', 'sabre')         |
|          | 213±4ms                    | 204±2ms                                | 0.96    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'stochastic', 'dense')    |
|          | 209±1ms                    | 208±2ms                                | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'stochastic', 'sabre')    |
|          | 64.5±0.6ms                 | 64.2±0.4ms                             | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'sabre', 'dense')         |
|          | 65.3±0.3ms                 | 64.8±0.5ms                             | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'sabre', 'sabre')         |
|          | 224±6ms                    | 217±2ms                                | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'stochastic', 'dense')    |
|          | 228±3ms                    | 219±0.8ms                              | 0.96    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'stochastic', 'sabre')    |
|          | 115±8ms                    | 105±1ms                                | ~0.91   | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'dense')         |
|          | 109±1ms                    | 108±0.9ms                              | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'sabre')         |
|          | 298±30ms                   | 272±1ms                                | 0.91    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'stochastic', 'dense')    |
|          | 278±9ms                    | 271±2ms                                | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'stochastic', 'sabre')    |
|          | 213±3ms                    | 206±0.8ms                              | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'dense')         |
|          | 227±10ms                   | 220±0.5ms                              | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'sabre')         |
|          | 571±40ms                   | 554±4ms                                | 0.97    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'stochastic', 'dense')    |
|          | 548±10ms                   | 535±2ms                                | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'stochastic', 'sabre')    |
|          | 51.1±0.8ms                 | 51.4±1ms                               | 1.01    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'sabre', 'dense')             |
|          | 47.9±0.4ms                 | 47.5±0.2ms                             | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'sabre', 'sabre')             |
|          | 156±5ms                    | 152±0.7ms                              | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'stochastic', 'dense')        |
|          | 191±8ms                    | 173±0.9ms                              | ~0.91   | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'stochastic', 'sabre')        |
|          | 74.1±10ms                  | 61.2±0.2ms                             | ~0.83   | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'sabre', 'dense')             |
|          | 60.4±0.7ms                 | 60.2±0.3ms                             | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'sabre', 'sabre')             |
|          | 167±2ms                    | 165±0.6ms                              | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'stochastic', 'dense')        |
|          | 173±2ms                    | 174±1ms                                | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'stochastic', 'sabre')        |
|          | 135±3ms                    | 134±1ms                                | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'sabre', 'dense')             |
|          | 135±1ms                    | 137±2ms                                | 1.01    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'sabre', 'sabre')             |
|          | 134±2ms                    | 135±2ms                                | 1.01    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'stochastic', 'dense')        |
|          | 134±0.5ms                  | 134±1ms                                | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'stochastic', 'sabre')        |
|          | 136±0.7ms                  | 135±0.7ms                              | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'sabre', 'dense')             |
|          | 137±0.7ms                  | 136±0.4ms                              | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'sabre', 'sabre')             |
|          | 140±2ms                    | 137±2ms                                | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'stochastic', 'dense')        |
|          | 136±1ms                    | 136±0.3ms                              | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'stochastic', 'sabre')        |
|          | 257                        | 257                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'sabre', 'dense')      |
|          | 241                        | 241                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'sabre', 'sabre')      |
|          | 245                        | 245                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'stochastic', 'dense') |
|          | 237                        | 237                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'stochastic', 'sabre') |
|          | 218                        | 218                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'sabre', 'dense')      |
|          | 212                        | 212                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'sabre', 'sabre')      |
|          | 221                        | 221                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'stochastic', 'dense') |
|          | 221                        | 221                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'stochastic', 'sabre') |
|          | 212                        | 212                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'sabre', 'dense')      |
|          | 208                        | 208                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'sabre', 'sabre')      |
|          | 207                        | 207                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'stochastic', 'dense') |
|          | 213                        | 213                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'stochastic', 'sabre') |
|          | 288                        | 283                                    | 0.98    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'sabre', 'dense')      |
|          | 259                        | 263                                    | 1.02    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'sabre', 'sabre')      |
|          | 332                        | 329                                    | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'stochastic', 'dense') |
|          | 304                        | 312                                    | 1.03    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'stochastic', 'sabre') |
|          | 55                         | 55                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'sabre', 'dense')      |
|          | 58                         | 58                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'sabre', 'sabre')      |
|          | 58                         | 58                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'stochastic', 'dense') |
|          | 53                         | 53                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'stochastic', 'sabre') |
|          | 51                         | 51                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'sabre', 'dense')      |
|          | 48                         | 48                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'sabre', 'sabre')      |
|          | 48                         | 48                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'stochastic', 'dense') |
|          | 53                         | 53                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'stochastic', 'sabre') |
|          | 49                         | 49                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'sabre', 'dense')      |
|          | 48                         | 48                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'sabre', 'sabre')      |
|          | 44                         | 44                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'stochastic', 'dense') |
|          | 53                         | 53                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'stochastic', 'sabre') |
|          | 73                         | 73                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'sabre', 'dense')      |
|          | 55                         | 55                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'sabre', 'sabre')      |
|          | 85                         | 84                                     | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'stochastic', 'dense') |
|          | 77                         | 77                                     | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'stochastic', 'sabre') |
|          | 610                        | 610                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'sabre', 'dense')      |
|          | 558                        | 558                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'sabre', 'sabre')      |
|          | 573                        | 573                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'stochastic', 'dense') |
|          | 545                        | 545                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'stochastic', 'sabre') |
|          | 507                        | 507                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'sabre', 'dense')      |
|          | 494                        | 494                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'sabre', 'sabre')      |
|          | 505                        | 505                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'stochastic', 'dense') |
|          | 498                        | 498                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'stochastic', 'sabre') |
|          | 488                        | 488                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'sabre', 'dense')      |
|          | 478                        | 478                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'sabre', 'sabre')      |
|          | 503                        | 503                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'stochastic', 'dense') |
|          | 457                        | 457                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'stochastic', 'sabre') |
|          | 647                        | 647                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'sabre', 'dense')      |
|          | 628                        | 624                                    | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'sabre', 'sabre')      |
|          | 769                        | 772                                    | 1.00    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'stochastic', 'dense') |
|          | 711                        | 705                                    | 0.99    | transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'stochastic', 'sabre') |

Apr 22 '24 16:04 arnaucasau

Based on the benchmarks in the https://github.com/Qiskit/qiskit/issues/12120#issuecomment-2050030794, I suspect this PR does improve the performance, but maybe this part of the code only makes up a small portion of the execution time in the asv benchmarks, making the impact inconsistently?

Yeah in general the ConsolidateBlocks pass hasn't been a bottleneck since we ported the linear algebra to rust (you can see the old tracking issue for this here: https://github.com/Qiskit/qiskit/issues/8779). So making it 50% faster is unlikely to be noticeable from a full transpile at optimization level 3 (and at level 2 after #12149) until other things are faster. This is more future facing once everything else in the transpiler is faster and this pass becomes more of a bottleneck. The impact for two_qubit_basis_decomposer.rs will hopefully be more noticeable since while the rust implementation of the linear algebra there was significantly faster the pass still is more of a bottleneck (although not the top one anymore). I was mostly curious to confirm this though, thanks for running the benchmarks.

Apr 23 '24 15:04 mtreinish

Re-running the tests I wrote in the issue, I realized that using std::time::Instant to benchmark each operation might not be the best approach. It turns out that there's a large variability in the elapsed time when running multiple tests at once or one by one. Moreover, the documentation says:

...instants are not guaranteed to be steady. In other words, each tick of the underlying clock might not be the same length (e.g. some seconds may be longer than others)

To double-check the results, I wrote similar tests but using the cargo bench command (only available in the nightly version). There, using test::black_block to avoid undesirable optimizations, I obtained the following results:

1 - In this case, the tests are comparing the kron() function from ndarray, the kron() method on a matrix object from faer, and the kron() function from faer that needs an allocation. The last one seems to be the best.

In this PR, because we are using Option to wrap the results, we would need to use an auxiliary function, but I'm not sure if it is worth changing again to use the kron with allocation given that there's almost no impact on the performance in my tests with ASV (the kron appear in 2 out 5 cases of the match). it could be interesting for two_qubit_decompose.rs, but let me know what you think.

test tests::benchmark01_ndarray_kron              ... bench:   12,480 ns/iter (+/- 3,439)
test tests::benchmark02_faer_overload_kron        ... bench:   25,293 ns/iter (+/- 3,565)
test tests::benchmark03_faer_kron                 ... bench:    4,399 ns/iter (+/- 671)

2- These 3 tests compare the functions dot(), * for faer, and matmul() simulating the case we have in the script we are modifying in this PR where we need to store the result in one of the matrices we use in the product. In this case, using matmul() with the swap technique implemented in df0edac seems to perform much better than the alternatives.

test tests::benchmark04_ndarray_dot               ... bench:   26,438 ns/iter (+/- 5,070)
test tests::benchmark05_faer_overload_matmul      ... bench:   26,455 ns/iter (+/- 4,192)
test tests::benchmark06_faer_matmul               ... bench:    7,231 ns/iter (+/- 2,002)

3- Finally, we have tests to combine both methods. The tests multiply a 4x4 matrix and another one obtained as a result of a kronecker product in a for loop storing the result in the first 4x4 matrix. Here, the four options are kron() and dot() from ndarray; kron() and matmul() from faer; kron() from ndarray and matmul() from faer; and kron() on a matrix object and * from faer. The best alternative is to use always faer with the extra allocation (uninitialized matrices, not using clone() or zeros()).

test tests::benchmark07_ndarray_kron_dot          ... bench:   70,301 ns/iter (+/- 5,617)
test tests::benchmark08_faer_kron_matmul          ... bench:   15,627 ns/iter (+/- 3,920)
test tests::benchmark09_ndarray_kron_matmul       ... bench:   22,966 ns/iter (+/- 1,890)
test tests::benchmark10_faer_overload_kron_matmul ... bench:   73,290 ns/iter (+/- 2,908)

All the tests were run in a loop of 100 iterations. Running the tests with few iterations or even without a for loop gave very inconsistent results in each run and with a big standard deviation in comparison with the mean. I suspect that this could be normal given that the times are so little in each operation individually.

It could be worth checking the assembler code generated to make sure the compiler is not making any optimization, but I wanted to give you all an update on the investigation as I was getting very confusing results with the other method last week.

Apr 29 '24 20:04 arnaucasau

This version should be faster than using a library for kron.

These functions change two things with respect to the library functions:

They use an explicit/unrolled version of kron for 2x2 matrices
They use the explicit expression that results when one of the matrices is the identity.

A couple of features:

There is no arithmetic at all. Just copying values to a matrix.
It uses only ndarrays. I think this package provides the standard implementation of heap-allocated multi-dimensional arrays for Rust. So we can't get more minimal than this if we want this data type.

It uses, for example,

// Compute `kron(identity, mat)` for 2x2 matrix inputs
fn kron_id2_oneq(oneq_mat: ArrayView2<Complex64>) -> Array2<Complex64> {
    let _zero = Complex64::new(0., 0.);
    array![
        [oneq_mat[[0,0]], oneq_mat[[0,1]], _zero, _zero],
        [oneq_mat[[1,0]], oneq_mat[[1,1]], _zero, _zero],
        [_zero, _zero, oneq_mat[[0,0]], oneq_mat[[0,1]]],
        [_zero, _zero, oneq_mat[[1,0]], oneq_mat[[1,1]]],
    ]
}

~~This could further be optimized by passing an array to the new kron_... functions. In fact I am usually in favor of~~ ~~writing a non allocating version and then implementing an allocating version that calls this~~.

I added two more functions that do the same, but that take pre-allocated arrays. I kind of doubt this will make a big performance difference.

EDIT: Notice that there was a match arm that appeared to do essentially matrix = identity * matrix. I replaced this with nothing (). Tests pass. But we should make triple-sure this is correct.

ANOTHER EDIT:

As mentioned above, AFAICT, ndarray is the de facto standard for heap-allocated, multi-dimensional arrays. It makes sense to me to write routines for these types and convert types from other packages at the call sites. If the conversions are free and ergonomic. Just pointing this out. The implementation I have here already does this.
~~I don't recall whether ndarray has a free, convenient, way to construct its Arrays from Rust's standard fixed-size arrays. If so it would be better to return a standard (fixed-size) Rust array from kron_id2_oneq. As it is, I return an ndarray type.~~

I'll try to benchmark or characterize this soon. If someone else already has code they can drop this into, feel free of course.

Apr 30 '24 22:04 jlapeyre

Thanks for the comment, @jlapeyre! That’s a really good idea, but I have one question. I see that you still use dot() in your implementation, are getting better results with it compared to matmul()? I normally see a speedup when using Faer.

I have been testing with your code and this could be an example of an equivalence with Faer that I think should be faster as well. However, in this case, as you said, I also think it won’t make a big performance difference. What do you think?

May 02 '24 13:05 arnaucasau

Using `mat_mul_to_dst`

Using dot was just an oversight. I forgot to pay attention to matmul. A routine written specifically for 4x4 matrices would probably be better. But that is for later.

It looks like ndarray has no documented way to do C <- A * B where C is preallocated, right? This is a big deficit. Is there a way to more-or-less cast the ndarray type to a faer type just for the matrix multiplication, rather than carrying the faer type around? I'm trying to think of a way to make the code more future resistant. ndarray is a Rust standard for a linear algebra data type (that doesn't implement non-allocating matrix multiplication, but whatever). Or another way to look at it is that we almost certainly will depend on ndarray for the foreseeable future because of Python. But faer, cgmath, ndalgebra may come and go.

To answer your question, since ndarray has to allocate, I agree using faer for now is a good idea. It might be faster for other reasons as well.

Other things

Did you test the special-case kron implementations isolated from other factors? I really should do this before committing to this approach. It seems clear that it should be much faster, but I'd like to be sure. If you did not, I can do it over the next few days.
My implementation got rid of this [] => Array2::eye(4). I'm pretty sure multiplying by this identity matrix can be elided.
swap(&mut aux, &mut matrix): I think this swaps the data. It should be possible to swap some kind of pointer to these without copying the data.
I used a verbose identifier in my implementation of kron..., which reduced readability. I prefer the lhs[(0, 0)] that you have. But what does lhs mean here? Is this like: $\text{lhs} \otimes \text{id}$ and then $\text{id} \otimes \text{rhs}$ ? But you have lhs everywhere.
There are some other uses in this crate of kron with 2x2 matrices, one of which is the identity. I think it's a good idea to change all of these in this PR.

May 02 '24 18:05 jlapeyre

It looks like ndarray has no documented way to do C <- A * B where C is preallocated, right?

I thought that too, but looking at the documentation, I found the function general_mat_mul. I tested it against matmul, but I got worse results anyway. I was using MaybeUninit to avoid initializing a new matrix, but maybe this is not the best way to do it.

Is there a way to more-or-less cast the ndarray type to a faer type just for the matrix multiplication, rather than carrying the faer type around?

The only way that I know is to use the faer_ext crate that has different methods to convert a ndarray matrix into a faer one and vice-versa. The downside is that it returns a view or a MatRef which is okay to perform the matrix multiplication, but it will imply needing an extra copy of the matrix if we need the ownership afterward.

Did you test the special-case kron implementations isolated from other factors?

I only tested it together with a matrix multiplication simulating the scenario we have in this PR, but I'll make a test between all kron functions and the special cases (returning ndarray arrays and faer) :+1:

swap(&mut aux, &mut matrix): I think this swaps the data.

Oh, interesting. My intention was to work with pointers here to avoid having to copy any matrix. I'll investigate about it :thinking:

is this like: lhs ⊗ id and then id ⊗ rhs? But you have lhs everywhere

Exactly! Sorry, I left lhs in the second function by error when doing the copy/paste. It is fixed now in the code.

There are some other uses in this crate of kron with 2x2 matrices, one of which is the identity. I think it's a good idea to change all of these in this PR

If I'm not wrong the other use of kron is at two_qubit_decompose.rs. It's a good idea to change it in this PR as well, but I think it might depend on the library we choose to implement the explicit/unrolled versions of kron. If we end up using faer, we'll need to do an extra copy of the matrices after the conversion, and maybe it's better (performance-wise) to just switch it in a follow-up that will also change the dot calls to use faer there.

May 03 '24 15:05 arnaucasau

I found the function general_mat_mul

I didn't mention that one because I didn't see a way to skip the scalar and vector operations.

but it will imply needing an extra copy of the matrix if we need the ownership afterward.

Huh. So faer does not provide a way to transfer ownership of the underlying data as well? If so, in the medium term, we could bring it up with the author or make a PR.

it might depend on the library we choose to implement the explicit/unrolled versions of kron

As long as we have more than one library and data type representing matrices, this kind of problem will come up repeatedly. We could use a macro to handle more than one kind of data. Or use traits. Or actually, since the code is very small, just implement these methods for both kinds of matrices.

We need to move these functions from the modules where they are applied to a common module. My best thought at the moment is something called common.rs. When the number of functions in that module starts to get large, we could split the functions out by category.

May 03 '24 17:05 jlapeyre

We need to move these functions from the modules where they are applied to a common module. My best thought at the moment is something called common.rs.

I like the idea! In the last commit, I moved there the new kron functions and the change_basis we were also using in other modules :+1:

After testing the new kron functions, I see that using ndarray is always faster than faer, so I changed the code to use ndarray except when we need a Mat with ownership. In this case, we could use the faer version to avoid the to_owned().

For this script, it doesn't make a big impact to use the versions that need preallocated arrays, so I used the allocating versions to make the code more idiomatic in the match, and I changed as well the other uses of kron in this crate (two_qubit_decompose.rs).

I've tested that the swap is swapping the pointers using the following assertion when running the tests, and it seems like that is the case:

let old_ptr1 = aux.as_ptr();
let old_ptr2 = matrix.as_ptr();
swap(&mut aux, &mut matrix);
assert_eq!(old_ptr1, matrix.as_ptr());
assert_eq!(old_ptr2, aux.as_ptr());

Summary of the changes in this PR:

The use of an explicit/unrolled version of kron 2x2 that improves the performances of the kron calls
The use of matmul() from faer improving the performance of the dot() calls
Introduction of common.rs as a module with different common functions between different scripts

In a follow-up, I'll change the two_qubit_decompose.rs script to use matmul() and maybe add more optimization like the new kron functions in the common.rs module.

May 07 '24 12:05 arnaucasau

@mtreinish

After looking closely at two_qubit_decompose.rs, I think we should keep ndarray for matrix multiplication in the accelerate crate. Even though faer improves the performance of the multiplications for the 2x2 and 4x4 cases, we end up needing to perform extra copies of the matrices when we want to convert them into PyArray using into_pyarray_bound().

This is caused by the fact that we can transform a faer Mat to an ndarray array view, but we need to use to_owned() to get the ownership.

An alternative to get a speedup is to follow @jlapeyre's suggestion and implement the matrix multiplication and kron by hand in size-fixed functions for the two cases. In this PR you can find an example of this using a new module named common.rs

In the new module we can find the following functions:

kron_matrix_x_identity(): A kron function that performs the product of a given matrix and the identity
kron_identity_x_matrix(): A kron function that performs the product of the identity and a given matrix.
matrix_multiply_2x2(): A 2x2 matmul completely unrolled
matrix_multiply_4x4(): A 4x4 matmul using three for loops which I think are easily vectorizable/unrolled for the compiler
determinant_4x4(): A function to calculate the determinant of a 4x4 matrix
change_basis(): The same function previously living on convert_2q_block_matrix.rs and moved here to be reused in more scripts

As you can see, in the common.rs module, we have more functions than just the matrix multiplications. All of them help increase the performance of the crate as they are more efficient. The determinant_4x4 function also allows us to stop using faer for that purpose and avoid doing an extra copy of a matrix in one part of the code.

In the following tests, we can see the performance improvement of one single call of those functions compared with the ndarray and faer equivalences:

Kronecker product 2x2 (identity ⊗ matrix)

test tests::benchmark11_faer_kron_2x2              ... bench:         143 ns/iter (+/- 9)
test tests::benchmark12_ndarray_kron_2x2           ... bench:         144 ns/iter (+/- 3)
test tests::benchmark13_common_kron_2x2            ... bench:          42 ns/iter (+/- 4)

Matrix multiplication 2x2

test tests::benchmark_faer_matmul_2x2              ... bench:         150 ns/iter (+/- 2)
test tests::benchmark_ndarray_dot_2x2              ... bench:         230 ns/iter (+/- 10)
test tests::benchmark_common_matrix_multiply_2x2   ... bench:          63 ns/iter (+/- 0)

Matrix multiplication 4x4

test tests::benchmark_faer_matmul_4x4              ... bench:         285 ns/iter (+/- 6)
test tests::benchmark_ndarray_dot_4x4              ... bench:         313 ns/iter (+/- 15)
test tests::benchmark_common_matrix_multiply_4x4   ... bench:         151 ns/iter (+/- 4)

Determinant

test tests::benchmark_faer_determinant_4x4         ... bench:         521 ns/iter (+/- 19)
test tests::benchmark_common_determinant_4x4       ... bench:         124 ns/iter (+/- 9)

To test the performance I have also run the transpiler tests using ASV. The result was that the performance was increased, and these are the tests that improved:

| Change   | Before [9d03b4bf] <main>   | After [a650f588] <AC/custom-linalg>   |   Ratio | Benchmark (Parameter)                                                                                      |
|----------|----------------------------|---------------------------------------|---------|------------------------------------------------------------------------------------------------------------|
| -        | 564±2ms                    | 512±0.8ms                             |    0.91 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'stochastic', 'sabre') |
| -        | 302±0.2ms                  | 264±0.5ms                             |    0.87 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'sabre')      |
| -        | 126±0.8ms                  | 109±0.7ms                             |    0.86 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'sabre', 'dense')      |
| -        | 249±0.4ms                  | 212±0.9ms                             |    0.85 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'stochastic', 'sabre') |
| -        | 158±0.4ms                  | 133±0.3ms                             |    0.84 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'sabre')      |
| -        | 264±0.7ms                  | 222±3ms                               |    0.84 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'dense')      |
| -        | 719±0.9ms                  | 599±0.2ms                             |    0.83 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'stochastic', 'sabre') |
| -        | 161±1ms                    | 125±0.4ms                             |    0.78 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'dense')      |
| -        | 391±0.5ms                  | 306±0.5ms                             |    0.78 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'sabre')      |
| -        | 347±0.7ms                  | 259±0.4ms                             |    0.75 | transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'dense')      |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

I was going to add tests to common.rs, but I wanted to show you all first to know your opinion about the change of approach.

Jun 03 '24 11:06 arnaucasau

qiskit qiskit copied to clipboard

Switch `convert_2q_block_matrix.rs` to use `matmul` from `faer` and an explicit version of `kron`

Pull Request Test Coverage Report for Build 8986784365

Details

💛 - Coveralls

Using mat_mul_to_dst

Other things

qiskit
qiskit copied to clipboard

Using `mat_mul_to_dst`