qiskit-aer
qiskit-aer copied to clipboard
add sampling options of iterative and binary search
Summary
Optimization of Sampling in QubitVector
Details and comments
In original implementation, a loop for sampling is iterated based on sample count. In this optimized implementation, a loop for sampling is iterated based on indices. In OpenMP, threads execute the same iterations if loop conditions are the same in most case. By using the same conditions with index construction and sampling, memory access is optimized if sampled values are randomly allocated in a qubitvector.
Here's a toy demo based on the conditional binomial method as implemented in numpy, etc. The single-threaded version is similar to the implementation in numpy. Performance depends on the binomial sampler used (on my laptop GSL seems faster than C++ std library), but is comparible to numpy.
// single-threaded
#include <iostream>
#include <random>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <chrono>
using namespace std::chrono;
int main()
{
const int nq = 25, nshot = 800000;
// const int nq=8, nshot=1000;
std::default_random_engine generator;
std::uniform_real_distribution<double> uniform(0.0, 1.0);
double *probs = new double[1 << nq], totalprob = 0.0;
int samples[nshot];
// Generate (unnormalized) nq-qubit probability distr. Don't include this in timing
for (int i = 0; i < 1 << nq; i++)
totalprob += (probs[i] = uniform(generator));
std::binomial_distribution<int> binom;
gsl_rng *gslgen = gsl_rng_alloc(gsl_rng_taus);
int s, offset = 0, r = nshot;
auto start = high_resolution_clock::now();
// Take nshot of samples from the above distribution, by conditional-binomial method:
for (int j = 0; j < (1 << nq) - 1; j++)
{
// s = binom(generator, std::binomial_distribution<int>::param_type(r, probs[j]/totalprob));
s = gsl_ran_binomial(gslgen, probs[j] / totalprob, r);
r -= s;
for (int k = 0; k < s; k++)
samples[offset++] = j;
if (!r)
break;
totalprob -= probs[j];
}
for (int k = 0; k < r; k++)
samples[offset++] = (1 << nq) - 1;
auto stop = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(stop - start);
std::cout << duration.count() << std::endl;
return 0;
}
For a parallel version you can divide the wavefn into roughly equal bins and then do a single-threaded multinomial sample to distribute samples between the bins before using OMP to sample within the bins. On my laptop the timing seems to scale well with the number of threads and is limited by the random number generation:
// multi-threaded
#include <iostream>
#include <random>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <omp.h>
#include <chrono>
using namespace std::chrono;
int main()
{
const int nq = 25, nshot = 800000, nthread = 8;
//const int nq = 8, nshot = 10000, nthread = 6;
std::default_random_engine generator;
std::uniform_real_distribution<double> uniform(0.0, 1.0);
double *probs = new double[1 << nq], totalprob = 0.0, partialtotal[nthread];
int samples[nshot];
// Generate (unnormalized) nq-qubit probability distr, summing partial accumulated probability into nthread ~equal bins
// Don't include this in timing (I assume that the accumulating partial probability was performed during last sweep over the wavefn before the measurement)
int js[nthread + 1];
js[0] = 0;
for (int i = 0, t = 0; t < nthread; t++)
{
partialtotal[t] = 0.0;
js[t + 1] = (t + 1) * (1 << nq) / nthread;
for (; i < js[t + 1]; i++)
partialtotal[t] += (probs[i] = uniform(generator));
totalprob += partialtotal[t];
}
gsl_rng *gslgen = gsl_rng_alloc(gsl_rng_taus);
// std::binomial_distribution<int> binom;
auto start = high_resolution_clock::now();
// Do a single-threaded multinomial to divide the shots among the bins
int r = nshot, offsets[nthread + 1];
offsets[0] = 0;
for (int t = 0; t < nthread - 1; t++)
{
int nsucc = gsl_ran_binomial(gslgen, partialtotal[t] / totalprob, r);
offsets[t + 1] = offsets[t] + nsucc;
r -= nsucc;
totalprob -= partialtotal[t];
}
offsets[nthread] = nshot;
// Do a parallel set of multinomials to sample from each of the bins in parallel. This should be the rate-limiting step
#pragma omp parallel num_threads(nthread)
{
int t = omp_get_thread_num();
int off = offsets[t], r = offsets[t + 1] - offsets[t], jn = js[t + 1], s;
gsl_rng *gslgen1 = gsl_rng_alloc(gsl_rng_taus);
double partial = partialtotal[t];
for (int j = js[t]; j < jn - 1; j++)
{
// s = binom(generator, std::binomial_distribution<int>::param_type(r, probs[j]/totalprob));
s = gsl_ran_binomial(gslgen1, probs[j] / partial, r);
r -= s;
for (int k = 0; k < s; k++)
samples[off++] = j;
if (!r)
break;
partial -= probs[j];
}
for (int k = 0; k < r; k++)
samples[off++] = jn - 1;
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(stop - start);
std::cout << duration.count() << std::endl;
/* for(int i=0; i<nshot; i++){
std::cout << samples[i] << std::endl;
} */
return 0;
}