qiskit-aer icon indicating copy to clipboard operation
qiskit-aer copied to clipboard

Expected multi threaded and GPU performance of the different simulators should be documented.

Open dietmarwo opened this issue 3 years ago • 4 comments

It would be helpful if there would be some hint about in which context to configure what simulator option, since the resulting performance is sometimes quite "surprizing".

Of course this is somehow CPU / GPU dependent, but I think the results don't differ much for typical modern many core CPUs.

Did some tests myself for a 8-36 qubit inverse fourier transform, see Simulation benchmark Table 1: Simulation benchmark . The table was produced using the code at https://gist.github.com/dietmarwo/23d30a89018d62c02294525092093671 on Linux Mint 20.3 / 16 core AMD 5950x CPU / NVIDIA 1660TI GPU. Used version: {'qiskit-terra': '0.21.1', 'qiskit-aer': '0.10.4', 'qiskit-ignis': '0.7.1', 'qiskit-ibmq-provider': '0.19.2', 'qiskit': '0.37.1', 'qiskit-nature': None, 'qiskit-finance': None, 'qiskit-optimization': None, 'qiskit-machine-learning': None}

simulator options time 8 qbits time 12 qbits time 18 qbits time 24 qbits time 30 qbits time 36 qbits
aer_simulator none 0.90 2.11 3.43 28.25 1427.3 14.14
aer_simulator max_parallel_threads=1 0.91 1.82 4.28 111.28 9035.0 12.46
aer_simulator device='GPU' 0.87 1.56 3.45 19.7 cuda error 13.89
qasm_simulator none 0.89 1.60 2.93 29.12 1434.6 14.38
qasm_simulator max_parallel_threads=1 0.90 1.60 4.09 110.66 9028.2 13.02
qasm_simulator device='GPU' 0.87 1.56 3.14 19.83 cuda error 14.49
aer_simulator_statevector none 0.91 1.58 3.61 28.85 1430.8 -
aer_simulator_statevector max_parallel_threads=1 0.89 1.6 3.88 110.4 9022.1 -
aer_simulator_statevector device='GPU' 0.87 1.56 2.96 19.31 cuda error -
aer_simulator_density_matrix none 0.91 10.06 - - - -
aer_simulator_density_matrix max_parallel_threads=1 0.89 34.15 - - - -
aer_simulator_density_matrix device='GPU' 0.87 4.01 - - - -

What I don't understand:

  • Why is it faster for 36 qubits than for 24 qubits?
  • Why is there no GPU scaling for <= 18 qubits beside for aer_simulator_density_matrix?
  • Why does the time grow so fast for 24 and 30 qubits?

I needed this information for the configuration of a parallel optimization algorithm using a quiskit simulator inside the fitness function. Bad simulation scaling means it is better to execute them single threaded and use optimization parallelization instead. But may be the simulators can be configured to scale better and I missed something?

dietmarwo avatar Aug 23 '22 11:08 dietmarwo

Hi! I know this is a bit out of topic, but have you been able to do these benchmarks in parallel using multiple GPUs with MPI protocol?

dotslaser avatar Aug 26 '22 07:08 dotslaser

Unfortunately not, currently I am using only one GPU. What I did was specific to my environment. Would be nice if the enhanced documentation would be more generic / complete if possible. It is difficult to predict what a parameter change does partly because qiskits multithreading is done inside the shared library controller_wrappers.cpython-39-x86_64-linux-gnu.so where users have limited insight. For qasm and aer simulation I anyway don't expect much gain from multiple GPUs.

dietmarwo avatar Aug 30 '22 12:08 dietmarwo

Performance depends on system configuration. Basically, in statevector simulation, simulation time will be 2x longer if 1 qubit is increased. Therefore, 30 qubits simulation will be 64x longer than 24 qubits in general.

GPU has overhead for its initialization. Therefore, for few qubits, GPU is not effective. Computation cost of 12 qubits of density matrix is same with 24 qubits of statevector. GPU can work well for 12qubit density matrix.

I guess 36-qubits simulation do not work well.

Finally, QFT is a typical workload but it is better to use more application. We will show some documentation for performance in near future.

hhorii avatar Aug 30 '22 13:08 hhorii

Thanks for the information.

We will show some documentation for performance in near future.

Looking forward to that. At https://github.com/dietmarwo/fast-cma-es/blob/master/tutorials/Quant.adoc#vqe-variational-quantum-eigensolver I wrote something about configuring parallelization of optimization of VQEs. Good scaling cannot be achieved using qiskits own optimizers. But there are alternatives available.

dietmarwo avatar Sep 05 '22 19:09 dietmarwo