vexcl Run "mba" in OpenCL

Currently in mba_benchmark.cpp, we see that the Spline creation module is run in CPU while the interpolation is done in GPU. This makes sense. However, would it be possible to make this portion (i.e., Spline creation module) work in GPU?

May 04 '14 18:05 skn123

I don't remember all the implementation details right away, but it seems it would be possible to do the setup phase in OpenCL. It could even make sense because all underlying structures are regular.

It will, however, take some time, because I am a bit busy at the moment.

May 04 '14 18:05 ddemidov

Denis, The setup was done using lot of C++11 code :) That's why I was having a hard time understanding it, but I have a suspicion that it can be done. Let me know if I can help in any way.

May 04 '14 18:05 skn123

I think it should be enough to implement the control lattice structure with OpenCL. See the referenced paper for the details of the algorithm.

The biggest problem is that it needs to be done in generic way w.r.t. the number of dimensions, so one would need to do some OpenCL code generation.

May 04 '14 18:05 ddemidov

Why not take the FFT route? Handcraft those kernels for 1,2 and 3D and make it work on the device for these 3 dimensions. For all other dimensions, it can continue to use the current Host option?

May 04 '14 18:05 skn123

I don't like the idea of keeping separate (but very similar) kernels when they all may be generated from a single source.

May 04 '14 18:05 ddemidov

If you are referring to the basic BA algorithm in the paper, then there are loops that can easily be "OpenMP-fied" even in the current implementation.

May 07 '14 11:05 skn123

Yes. Feel free to provide a pull request :).

May 07 '14 11:05 ddemidov

So I looked at mba implementation a bit closer. Now I know why I decided to stay on the CPU for the initialization.

First, and least important, VexCL supports parallel work with multiple compute devices. Since MBA may take a random set of coordinates to get interpolated values at, each device has to hold a complete copy of the control lattice phi. It did not seem right to either duplicate the initialization work for each device, or do initialization on the first device and then transfer the lattice to the rest of the devices.

Second, in the initialization loop over the data points here temporary arrays delta and omega are updated at some random positions. This could lead to data races both with OpenMP and OpenCL parallelizations and would require use of atomic operations. This would negatively affect performance.

For example, take a look at c217b227409a70844b24eeabd8559f3da857aa32. Here are results of mba_benchmark with and without openmp parallelization:

No OpenMP

1. Capeverde (AMD Accelerated Parallel Processing)

surf(0.5, 0.5) = -4.48714e-05
surf(0.5, 0.5) = -4.48714e-05

[Profile:             8.889 sec.] (100.00%)
[  generate data:     0.047 sec.] (  0.53%)
[  GPU:               3.711 sec.] ( 41.75%)
[    setup:           2.320 sec.] ( 26.10%)
[    interpolate:     1.385 sec.] ( 15.58%)
[  CPU:               5.131 sec.] ( 57.72%)
[    setup:           2.324 sec.] ( 26.14%)
[    interpolate:     2.806 sec.] ( 31.57%)

OpenMP

1. Capeverde (AMD Accelerated Parallel Processing)

surf(0.5, 0.5) = -4.48714e-05
surf(0.5, 0.5) = -4.48714e-05

[Profile:            17.936 sec.] (100.00%)
[  generate data:     0.047 sec.] (  0.26%)
[  GPU:              12.662 sec.] ( 70.60%)
[    setup:          11.137 sec.] ( 62.09%)
[    interpolate:     1.519 sec.] (  8.47%)
[  CPU:               5.226 sec.] ( 29.14%)
[    setup:           2.322 sec.] ( 12.95%)
[    interpolate:     2.903 sec.] ( 16.19%)

Note that setup now takes 11 seconds instead of 2.3.

So unless I did something wrong here, it seems its better to leave the current MBA setup as it is.

May 09 '14 19:05 ddemidov

Ok, I did something wrong here. After replacing critical section with atomic in ecad92c. mba_bechmark output looks like this:

1. Capeverde (AMD Accelerated Parallel Processing)

surf(0.5, 0.5) = -4.48714e-05
surf(0.5, 0.5) = -4.48714e-05

[Profile:             8.834 sec.] (100.00%)
[  generate data:     0.046 sec.] (  0.52%)
[  GPU:               3.642 sec.] ( 41.23%)
[    setup:           2.251 sec.] ( 25.48%)
[    interpolate:     1.386 sec.] ( 15.69%)
[  CPU:               5.145 sec.] ( 58.25%)
[    setup:           2.327 sec.] ( 26.34%)
[    interpolate:     2.818 sec.] ( 31.90%)

This is better than serial version, but only slightly. This also introduces requirement that iterators to coordinates and values of data points are pointing at continuous chunks of memory. I am not sure the neglectable speedup worth it. What do you think?

May 09 '14 19:05 ddemidov

If I look at BA algorithm in the paper, there are three main loops. If I understand correctly, the innermost loop can be parallelized using Boost-Compute (or even Bolt) as it is potentially data parallelizable. The third loop where we eventually compute the \phi is anyway parallelizable. If we assume data parallelizable, then barring the boundary points, \delta_{i+k}{j+l} can be computed locally. There won't be any need for inter data communication except at the boundary locations. \delta and \omega at these locations can be done serially once inner locations have been computed. Your other concerns regarding multi-device operations is valid. Hence, a mechanism should be available to use this parallel version (either as an ifdef or some user-specified input). That being said, anything that utilizes any form of parallelizaltion (CPU and GPU) is always welcome. Do you think I am correct in my assumptions?

May 10 '14 07:05 skn123

I have another question regarding mba_benchmark,cpp vex::multivector<double, 2> C(ctx, n); vex::vector Z(ctx, n);

    vex::copy(x, C(0));
    vex::copy(y, C(1));

    prof.tic_cl("interpolate");
    for(size_t i = 0; i < m; ++i)
        Z = surf(C(0), C(1));

Now, is this the only way to copy x and y to the device? Would it be possible to allocate values to C(0) and C(1) directly without having to first allocate them on the host?

May 10 '14 17:05 skn123

There are no host-allocated structures in the snippet you provided. Both vex::multivector and vex::vector are device structures. And those are allocated directly.

If what you meant to ask is if its possible to initialize newly created device vector with a host vector data, then the answer is yes, it is possible. See the list of vex::vector constructors.

The mba_bechmark does look a bit ugly in this regard. 56e236a fixes that.

May 10 '14 18:05 ddemidov

Regarding the MBA algorithm, if you have a closer look at BA algorithm on page 4 of the paper, you'll notice that there is an outer loop over scattered data points (for each point ( x_c , y_c , z_c) in P do), which updates accumulator arrays delta and omega on each iteration at random locations. This is the problem I was talking about earlier which would lead to a data race. It would require the use of atomic operations which could possibly kill the performance gains of parallelization.

May 10 '14 18:05 ddemidov

Yes, indeed that is the case of (for each point ( x_c , y_c , z_c) in P do) and that is what I meant by "data parallelism". If I were to use a CPU as an example, I would divide the data into k-parts (where k is the # of cores) and run BA on each of the control points within each part, except for the boundary points. This is a classic case of parallelizing region merging algorithms using the "Union-find" method. If you do using atomic, indeed it will kill the performance. The advantage of spline interpolation is that it only requires points from neighbors (1st 2nd 3rd..etc) and not all the regions. I don't know how to go about doing data parallelism in GPU :)

Now, for your previous comment: Yes, the mba_benchmark looks much cleaner now. However, my question still remains unanswered. Let me explain my situation. I have a structured image grid having dimensions inSize[0] * inSize[1] * K. I would like to fill up x, y and z with the respective indices. If I were to do this using OpenMP, I would do it the following way: #pragma omp parallel for for (size_t i = 0; i < n; ++i) { x[i] = (float)(i % inSize[0]); y[i] = (float)((i - inSize[0]) / inSize[0]) % inSize[1]; z[i] = (float)mSortedBands[((i - inSize[0]) / inSize[0] - inSize[1]) / inSize[1]];
}

where mSortedBands contains index values Ideally, I should be able to do the whole thing in a Kernel. If that is the case, I would not have to initialize the x, y or z values on the host and directly go to the device.

Any pointers on how one can do this in VexCl? Maybe this is the right time to write my first OpenCL (or should I say, VexCL) kernel :)

Also, why are not freeing the x, y vectors in mba_benchmark? How do the device vectors get deallocated? Are they smart pointers?

May 10 '14 18:05 skn123

Ok, the question about data allocation is a lot clearer now. You could do this:

vex::vector<float> x(ctx, n), y(ctx, n), z(ctx, n);
auto i = vex::element_index();
x = i % inSize[0];
y = ((i - inSize[0]) / inSize[0]) % inSize[1];

Not sure about mSortedBands, but if it was a vex::vector<int>, you could do

z = vex::permutation(((i - inSize[0]) / inSize[0] - inSize[1]) / inSize[1])(mSortedBands);

May 10 '14 18:05 ddemidov

Regarding a K-way split of input data, how would you do it on CPU? Would each core skip points that do not belong to its subdomain? Or would you do a sort-by-key first, where key is the subdomain each point belongs to? It seems that on a GPU only second of these options would make sense, but then it has worse algorithmic complexity than the original operation.

May 10 '14 18:05 ddemidov

Perfect, then the only thing that would be of interest in this implementation would be the aspect of data parallelization. To understand this concept, take a look at Fig 3 in this paper: http://crd.lbl.gov/assets/pubs_presos/Harrison-EGPGV2011.pdf

May 10 '14 18:05 skn123

Regarding the deallocation: vex::vectors behave in the same way std::vectors do. They deallocate themselves when going out of scope. That's an example of RAII idiom.

May 10 '14 18:05 ddemidov

So in the mba_benchmark do I explicitly have to state p.clear(), v.clear() and so on...?

May 10 '14 18:05 skn123

No, you just let them go out of scope. No memory will leak.

May 10 '14 18:05 ddemidov

The technique described in the paper by Harrison et al (and domain decomposition in general) is suitable for fat cluster nodes or CPU cores. This is an example of coarse-grain parallelism. GPUs on the other hand, have fine-grained parallelism, where each thread is assigned to a single data point (e.g. matrix or vector element). So I don't think this approach could be used here.

May 10 '14 19:05 ddemidov

Thanks for the terms :) I always wondered why we cannot use GPU for coarse grained parallelism. However, the parallelism I am hinting at can (??) be achieved by the process of interleaving. As indicated in Sec 3.2 "...Since each data point in P influences a set of 4 x 4 neighboring control points..." I need to ensure that when I am running the outer loop, I am only updating those control points which are not influenced by neighbors. The only way this would be possible is to use a multi-grid mechanism. Let us assume I have points x \in [0, M-1], y \in [0, N-1], then Grid 1: x_1 \in [0 : 4 : M -1], y_1 \in [0 : 4 : N -1] ,...and so on for each dimension Grid 2: x_1 \in [1 : 4 : M -1], y_1 \in [1 : 4 : N -1] ,...and so on for each dimension Grid 3: x_1 \in [2 : 4 : M -1], y_1 \in [2 : 4 : N -1] ,...and so on for each dimension Grid 4: x_1 \in [3 : 4 : M -1], y_1 \in [3 : 4 : N -1] ,...and so on for each dimension All the other operations within the main loop are localized (i.e., no communication between processes). As per your notation, I can (??) run a fine-grained parallelization for Grid_i and then sum them all up (because that's what is actually happening in the innermost loop. What do you think? Have I interpreted this correctly?

May 11 '14 04:05 skn123

Could you provide a working openmp prototype for your idea? Just for the main loop over data points on page 4 with some random input.

May 12 '14 05:05 ddemidov

Denis I had an even closer look at the paper (after all this) as I was convinced of what I mentioned. Take a look at the last paragraph of Page 3 in the paper (right had column). "...In general, we resolve multiple assignments....". Every control point \phi is dependent on data points in its 4 x 4 neighborhood. Lets take an analogy of median filtering (5 x 5) and as we know, they are embarassingly parallel. Now, to check this, I browsed through the web and came up with this codebase that is there on github: https://github.com/SINTEF-Geometry/MBA

Check the functions:

void MBA::BAalg()
inline void ijst(int m, int n, double uc, double vc, int& i, int& j, double& s, double& t)

and

inline void WKLandSum2(double s, double t, double w_kl[4][4], double& sum_w_ab2)

I have extracted all the relevant codes pertaining to BA algorithm only. We are not assuming UNIFORM_CUBIC_C1_SPLINES

for (int ip = 0; ip < noPoints; ip++) 
{

    // Map to the half open domain Omega = [0,m) x [0,n)
    // The mapped uc and vc must be (strictly) less than m and n respectively
    double uc = (data_.U(ip) - data_.umin()) * interval_normalization_factor_u; 
    double vc = (data_.V(ip) - data_.vmin()) * interval_normalization_factor_v;

    int i, j;
    double s, t;
    UCBspl::ijst(m_, n_, uc, vc, i, j, s, t);

    // compute w_kl's and SumSum w_ab^2 here:
    double w_kl[4][4];
    int k,l;
    double sum_w_ab2_inv = 0.0; 
    UCBspl::WKLandSum2(s, t, w_kl, sum_w_ab2_inv);

    sum_w_ab2_inv = double(1) / sum_w_ab2_inv;

    double zc = data_.Z()[ip];

    // check p. 231: k=(i+1) - flor(xc) and l = ...
    for (k = 0; k <= 3; k++) 
  {
      for (l = 0; l <=3; l++) 
    {
      // compute phi_kl with equation (3)        
      double tmp = w_kl[k][l];

      // 1. Originally
      double phi_kl = tmp * zc * sum_w_ab2_inv;

      // 2. Alternatively, to let it tapper of more smoothly (but more efficient if permantly) 
      //double t = 0.8; double phi_kl = (1.0-t)*tmp*zc/sum_w_ab2 + t*zc;

      // 3. Alternatively, with linear term
      // double alpha = 0.01; double phi_kl = (tmp*zc + alpha*(tmp - sum_w_ab2)) / sum_w_ab2;

      // And alternatively for equation (5):
      // from |w_kl|^2 to |w_kl| to get a weighted average
      // just skip the next statement

      tmp *= tmp;

      delta_(i+k,j+l) += tmp*phi_kl;
      omega_(i+k,j+l) += tmp;

      }
    } 
}

// s,t \in [0,1) (but special on gridlines m and n)
// i,j \in [-1, ???
inline void ijst(int m, int n, double uc, double vc, int& i, int& j, double& s, double& t) 
{
  //int i = std::min((int)uc - 1, m-2);
  //int j = std::min((int)vc - 1, n-2);

#ifdef  UNIFORM_CUBIC_C1_SPLINES
  i = 2*((int)uc) - 1;
  j = 2*((int)vc) - 1;
#else
  i = (int)uc - 1;
  j = (int)vc - 1;
#endif

  s = uc - floor(uc);
  t = vc - floor(vc);

  // adjust for x or y on gridlines m and n (since impl. has 0 <= x <= m and 0 <= y <= n 
#ifdef  UNIFORM_CUBIC_C1_SPLINES
  if (i == 2*m-1) {
    i-=2;
    s = 1;
  }
  if (j == 2*n-1) {
    j-=2;
    t = 1;
  }
#else
  if (i == m-1) {
    i--;
    s = 1;
  }
  if (j == n-1) {
    j--;
    t = 1;
  }
#endif
}

inline void WKLandSum2(double s, double t, double w_kl[4][4], double& sum_w_ab2) 
{
    sum_w_ab2 = 0.0;
    double Bs0 = B_0(s); double Bt0 = B_0(t);
    double Bs1 = B_1(s); double Bt1 = B_1(t);
    double Bs2 = B_2(s); double Bt2 = B_2(t);
    double Bs3 = B_3(s); double Bt3 = B_3(t);

    double tmp;

    // unrolled by Odd Andersen 15. dec. 2003, for optimization
    tmp = Bs0 * Bt0; w_kl[0][0] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs0 * Bt1; w_kl[0][1] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs0 * Bt2; w_kl[0][2] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs0 * Bt3; w_kl[0][3] = tmp; sum_w_ab2 += tmp * tmp;

    tmp = Bs1 * Bt0; w_kl[1][0] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs1 * Bt1; w_kl[1][1] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs1 * Bt2; w_kl[1][2] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs1 * Bt3; w_kl[1][3] = tmp; sum_w_ab2 += tmp * tmp;

    tmp = Bs2 * Bt0; w_kl[2][0] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs2 * Bt1; w_kl[2][1] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs2 * Bt2; w_kl[2][2] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs2 * Bt3; w_kl[2][3] = tmp; sum_w_ab2 += tmp * tmp;

    tmp = Bs3 * Bt0; w_kl[3][0] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs3 * Bt1; w_kl[3][1] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs3 * Bt2; w_kl[3][2] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs3 * Bt3; w_kl[3][3] = tmp; sum_w_ab2 += tmp * tmp;

//     int k,l;
//     sum_w_ab2 = 0.0; 
//     for (k = 0; k <= 3; k++) {
//  for (l = 0; l <=3; l++) {
//      double tmp = w(k, l, s, t);
//      w_kl[k][l] = tmp;
//      sum_w_ab2 += (tmp*tmp); 
//  }
//     }
}

May 15 '14 06:05 skn123

I was not saying that you are wrong, I merely asked a working prototype of your parallelization idea to see what effect on performance and memory footprint it would make. I don't see it in the code snippet you provided.

Cheers, Denis

May 15 '14 06:05 ddemidov

I will take a crack at it. I would not, however be able to translate this it to a VexCL format, but I think I can make it using OpenMP.

May 15 '14 06:05 skn123

An openmp implementation should be enough. Also, you don't need to implement the full BA algorithm. A parallelization of the loop on p.4 is enough.

May 15 '14 06:05 ddemidov

Since you do not like "debug" environments (Read somewhere else that you like to plan your code incrementally !) let us discuss this here :) I can foresee that parallelization may lead to increased memory load, but if it is worth it then so be it.

The first part of the loop in my previous comment is perfectly parallelizable as it does not depend on anything else (i.e., only dependent on u, v / i, j)

The problem arises with the innermost loop. So let us address that first.

Suppose I have a grid size of (m,n); How I would go about doing this would be as follows: a.) Create empty \delta and \omega arrays b.) As each control point depends on only its 4 x 4 neighborhood, I will divide the outer loop into two for loops (dunno how VexCL would do it though) as: for (int i_idx = 0+offset1; i_idx < m; i_idx+=4) // offset1 = 0, 1, 2, 3 for (int j_idx = 0+offset2; j_idx < n; j_idx+=4) // offset2 = 0, 1, 2, 3 { run the entire pipeline in Pg 4 (including the inner loop) and update all the corresponding locations of \delta \omega matrices for all pixels in parallel. Here, we are only updating 1/16 of the original grid size. As we have allocated memory for \delta and \omega we can safely update them concurrently without having any race conditions as the grid points being updated in this index location (i, j) are completely outside the influence zone of other corresponding grid locations. } c.) Now, if this can be ported to a kernel, we need an outer loop that would address all the 16 offsets that will arise. This would then be very similar to how you write your interpolation loop by calling the kernel 100 times in mba_benchmark.cpp

So, a pseudocode for this would be (if I assume integer grid locations for now)
std::map<int, std::pair<int> > offsets (i.e., 0 : 0,0, 1 : 0,1, ..and 15 : (3,3) )
//explicitly isolate the indices in host or do it implicitly in the device. Store the indices in a std::map<> myIndices
//Allocate memory for \delta and \omega
for (i = 0; i < offsets; ++i)
{
 // OpenCL Kernel operation
  BAAlg(myIndices[i], \delta, \omega, \grid_data) (if done explicitly)
  or
  BAAlg(i, \delta, \omega, \grid_data) (if done implicitly)

I have written a C++ function. I presume \grid_data will be copied to the device prior to calling this function, and so will \delta and \omega
}

Does this make sense?

May 15 '14 09:05 skn123

So you will need to run the strided loop in (b) 4^ndim times where ndim is the number of dimensions (2 in your case). I suspect that each run would take as much time as a non-strided one due to caching issues. So the whole thing would possibly run 4^ndim times slower than serial one.

On the other hand, now you should be able to run the strided loop in parallel on each core of your CPU. This should not give you linear speedup though, because it seems to be memory bound and memory bandwidth does not scale as well as arithmetics. For example, from my experience with SPMV, openmp will run twice as fast as serial code irregardless of number of cores used.

So in total I expect it to be slower than serial code, but one can only be sure after some experimenting. That's why I would like to see some compilable, runnable, and measurable code.

May 15 '14 10:05 ddemidov

Also, could you please enclose the code snippets with code markers as in ~~~{.cpp} int foo = 1; ~~~

This would increase readability and look like this:

int foo = 1;

May 15 '14 10:05 ddemidov

vexcl vexcl copied to clipboard

Run "mba" in OpenCL

vexcl
vexcl copied to clipboard