Gpufit Can GPU fit handle functions that returns vectors instead of a single value?

Hi,

I have a function that returns more than value that I have to fit, like a f : R^3 -> R^3. I was wondering if GPUFit can handle such cases as well, but it doesn't seem to do so.

Regards

Feb 27 '18 14:02 lukkio88

Can you be more specific? In general, a multidimensional fit can be reformulated as a normal, scalar fit, and solved accordingly.

Feb 27 '18 14:02 superchromix

I meant something link:

f(x_1,x_2,x_3 ; alpha_1, alpha_2) = {alpha_1x_1 + alpha_2x_2, alpha_1x_2 + alpha_2x_3, alpha_1alpha_2x_1 + e^alpha_2 x_3};

So my samples are point in 3D space and I have to fit those with that vector function. My actual case is much more complicated and doesn't have a simple closed form expression.

You can also imagine the function has to be fit using a LSE minimizer.

Feb 27 '18 14:02 lukkio88

Is it possible to fit such a function? Or is there a workaround for that?

Feb 27 '18 15:02 lukkio88

Yes, it can but you have to use a customized model function and fit estimator. Writing your own model function and fit estimator you can easily fit functions with vectors as output in Gpufit. See: http://gpufit.readthedocs.io/en/latest/customization.html

I guess using the built-in LSE estimator would have problems though because it assumes that the number of model values is equal to the number of independent variable values.

Feb 28 '18 10:02 jkfindeisen

According to the documentation I should...

Define an additional model ID in file constants.h. (Easy...
Implement a CUDA device function within a newly created .cuh file in folder Gpufit/Gpufit/models according to the following template.

__device__ void ... (               // ... = function name
    float const * parameters,
    int const n_fits,
    int const n_points,
    float * value,
    float * derivative,
    int const point_index,
    int const fit_index,
    int const chunk_index,
    char * user_info,
    std::size_t const user_info_size)
{
    ///////////////////////////// value //////////////////////////////

    value[point_index] = ... ;                      // formula calculating fit model values

    /////////////////////////// derivative ///////////////////////////
    float * current_derivative = derivative + point_index;

    current_derivative[0 * n_points] = ... ;  // formula calculating derivative values with respect to parameters[0]
    current_derivative[1 * n_points] = ... ;  // formula calculating derivative values with respect to parameters[1]
    .
    .
    .
}

How do I specify in this case, "the derivative is a matrix" and "the function should output three values" instead of one? I assume for the derivative I'll just address the array accordingly, it should be a problem.

Include the newly created .cuh file in models.cuh (Or I just add my new model in an existing .cuh I guess)
Add a switch case in the CUDA device function calculate_model() in file models.cuh to allow calling the newly added model function. (Easy to do I guess, is there anything I should keep in mind that I might possibly not considering?).

For the estimator I guess I'll be asking something later, but should I essentially implement a specific LSE estimater for my vector function?

Thx

Feb 28 '18 11:02 lukkio88

Hi, could anyone give me more details based on what I asked above? I can't figure out how to proper index the derivatives and function values in my case. The documentation doesn't help much...

Feb 28 '18 18:02 lukkio88

Hi,

I think what you're asking for is already possible using the existing Gpufit framework. Your function returns a vector of values, let's say 3 values, and you are fitting data that has three data values per coordinate.

In this situation you can treat the problem as a simultaneous fit to multiple datasets, with shared parameters. For example, let's say your data values are a 3 x N array (a vector of 3 values for N coordinates) with a value [Y1,Y2,Y3] at each coordinate N. You could concatenate these values into a long 1D array [Y1(x1), Y1(x2), Y1(x3)...... Y2(x1), Y2(x2), Y2(x3).... Y3(x1), Y3(x2), Y3(x3), .... ] for sending to Gpufit. Next, write your model function to return the appropriate value, depending on which element it is working on - i.e. if your model function is working on a point in the Y1 section of the array, then it should return the first element of the vector, if it is in the Y2 section, then the second element, and so on. The same is true for the derivative values. The function is always using the same parameter set, however.

The existing LSE or MLE estimators should already work for this case, without modification.

Does this solution address your question?

Mar 01 '18 09:03 superchromix

I fully agree with superchromix. The existing LSE and MLE estimators will work with vector-valued functions as long as the number of data points per fit is the number of x coordinates times number of values per fit.

Just for completeness: an alternative ordering would be [Y1(x1) Y2(x1) Y3(x1) Y1(x2) Y2(x2) Y3(x3) ...].

The task reduces to implement a custom model function. Me might not yet have an example for that case. Might be good to have one.

Mar 01 '18 09:03 jkfindeisen

Hi superchromix, jkfindeisen.

Because there's no example that would expound the details of what you're proposing I'd need you to guide me... So please bear with me.

In this situation you can treat the problem as a simultaneous fit to multiple datasets, with shared parameters. For example, let's say your data values are a 3 x N array (a vector of 3 values for N coordinates) with a value [Y1,Y2,Y3] at each coordinate N.

So far I'm with you...

For example, let's say your data values are a 3 x N array (a vector of 3 values for N coordinates) with a value [Y1,Y2,Y3] at each coordinate N. You could concatenate these values into a long 1D array [Y1(x1), Y1(x2), Y1(x3)...... Y2(x1), Y2(x2), Y2(x3).... Y3(x1), Y3(x2), Y3(x3), .... ] for sending to Gpufit.

For the examples provided with GPUFit the generation of the samples is implemented in a separate function. In the case of the void gauss_fit_2d_example() {...} this is where the samples generation is performed. I assume I'll need to create a function like this as unit test for my model, which convey the generation of samples according to the layout you mentioned, is this correct?

Next, write your model function to return the appropriate value, depending on which element it is working on - i.e. if your model function is working on a point in the Y1 section of the array, then it should return the first element of the vector, if it is in the Y2 section, then the second element, and so on.

According to the templates in the customization page, what parameter/parameters of the model function would tell me on what section of the array I'd be working on?

The same is true for the derivative values. The function is always using the same parameter set, however.

About the derivatives/hessians, I'm understanding that in this case the derivative would be a a matrix, while the hessian would be a 3D array in theory (though I'd re-index a 1D array to reproduce the same structure) is this correct?

Mar 01 '18 11:03 lukkio88

I assume I'll need to create a function like this as unit test for my model, which convey the generation of samples according to the layout you mentioned, is this correct?

It's up to you if you want to create such a unit test

According to the templates in the customization page, what parameter/parameters of the model function would tell me on what section of the array I'd be working on?

The "point index" tells you which element you are working on. Your model function would contain something like a switch statement, returning a different value depending on the point index.

About the derivatives/hessians, I'm understanding that in this case the derivative would be a a matrix, while the hessian would be a 3D array in theory (though I'd re-index a 1D array to reproduce the same structure) is this correct?

The derivatives are also returned as a 1D array, of length n_parameters * n_points. In the example I gave above for each parameter and data point you have three derivatives, and you need to return them with the same ordering as how you choose to organize the input data.

Mar 01 '18 11:03 superchromix

Continuing with the example, the final size of your derivatives array would be n_parameters * n_points * length_of_function_vector (three in this case).

Mar 01 '18 12:03 superchromix

To see how to organize the derivatives, you can look into e.g. the Gaussian model function code.

e.g. http://github.com/gpufit/Gpufit/blob/master/Gpufit/models/gauss_1d.cuh

Mar 01 '18 12:03 superchromix

No. The derivatives are also returned as a 1D array, of length n_parameters * n_points. In the example I gave above for each parameter and data point you have three derivatives, and you need to return them with the same ordering as how you choose to organize the input data.

But in my case that 1D array will represent the Jacobian of my model right? How about the hessian instead?

In the example of the 1D gaussian function (which dependes on 4 parameters) your 1D array of derivatives represents the gradient of the model with respect to the parameters to be fitted. This means if I had two components that 1D array should represent a matrix. Is that right?

(It will still be a 1D array, but it's the interpretation that changes).

Unless, again, I'm missing something.

Mar 01 '18 12:03 lukkio88

You don't need to calculate the Hessian - you only want the derivative of the function with respect to each of the parameters. You are essentially treating your function, which returns a vector (of e.g. length three) as three separate functions.

Mar 01 '18 12:03 superchromix

I won't need the hessian because the estimator won't change, right? I'll re-use the ones already available.

Mar 01 '18 12:03 lukkio88

Yes, you can use the standard estimator.

Mar 01 '18 12:03 superchromix

Everything boils down to just create a new model with the features you mentioned then, if this is true I'll follow the guidelines you just gave me. I'll eventually post the example I have in mind, if it is of any interest.

Mar 01 '18 12:03 lukkio88

Great. Yes that would be helpful - we could add it to the docs. Post back here if you need any more guidance. You might want to create a simple toy model function to get a feeling for this first, before implementing your specific function.

Mar 01 '18 12:03 superchromix

I'll try. Thank you.

Mar 01 '18 12:03 lukkio88

Hi, I've this toy example.

/* Description of the test function
* =================================
* Computes function value and derivatives of the function
* 
* f(x[0],x[1],x[2]) = {p[0]*x[0] + p[1]*x[1],p[0]*x[1] + p[2]*x[2]}
* The derivative, namely the Jacobian with respect the the parameters is given by
* Jacobian(f) = {
*	x[0], x[1], 0.0f,
*   x[1], 0.0f, x[2]
*	}
*
*/

__device__ void calculate_vector_function(
    float const * parameters,
    int const n_fits,
    int const n_points,
    float * value,
    float * derivative,
    int const point_index,
    int const fit_index,
    int const chunk_index,
    char * user_info,
    std::size_t const user_info_size) {
	// indices

    float * user_info_float = (float*)user_info;
    float x = 0.0f;
    if (!user_info_float)
    {
        x = point_index;
    }
    else if (user_info_size / sizeof(float) == n_points)
    {
        x = user_info_float[point_index];
    }
    else if (user_info_size / sizeof(float) > n_points)
    {
        int const chunk_begin = chunk_index * n_fits * n_points;
        int const fit_begin = fit_index * n_points;
        x = user_info_float[chunk_begin + fit_begin + point_index];
    }

    // parameters

    float const * p = parameters;
    
    // values and jacobian
	if(!(point_index & 1)) {

		value[point_index] = p[0]*x[0] + p[1]*x[1]; //first component
		
		current_derivative[0 * n_points]  = x[0]; //gradient first component
		current_derivative[1 * n_points]  = x[1];
		current_derivative[2 * n_points]  = 0.0f;

	} else {

		value[point_index] = p[0]*x[1] + p[2]*x[2]; //second component
		
		current_derivative[3 * n_points]  = x[1]; //gradient second component
		current_derivative[4 * n_points]  = 0.0f;
		current_derivative[5 * n_points]  = x[2];	

	}


	}

Now, did you mean something like this? Also if you could clarify how do I take into account that some parameters are shared that would be helpful.

Mar 01 '18 15:03 lukkio88

I think it should be something more like this.

This really depends on how you are passing the data into Gpufit. Below, I assume that the data is interleaved - i.e. [Y1(0), Y2(0), Y1(1), Y2(1), ...] . I also removed the parts involving user_info, since it's not strictly necessary to use custom x values.


__device__ void calculate_vector_function(
    float const * parameters,
    int const n_fits,
    int const n_points,
    float * value,
    float * derivative,
    int const point_index,
    int const fit_index,
    int const chunk_index,
    char * user_info,
    std::size_t const user_info_size) 

{
    // indices

    float * user_info_float = (float*)user_info;
    float x = 0.0f;


    x = point_index / 2;    // INTEGER DIVISION


    // parameters

    float const * p = parameters;
    
    // derivative

    float * current_derivative = derivative + point_index;

    // values and jacobian
    if(point_index % 2 == 0) {

        // even point indices - first element of the vector

        value[point_index] = p[0]*x + p[1]*x; //first component
        current_derivative[0 * n_points]  = x; //gradient first component
        current_derivative[1 * n_points]  = x;
        current_derivative[2 * n_points]  = 0.0f;

    } else {

        // odd point indices - second element of the vector

        value[point_index] = p[0]*x[1] + p[2]*x[2]; //second component
		
        current_derivative[0 * n_points]  = x; //gradient second component
        current_derivative[1 * n_points]  = 0.0f;
        current_derivative[2 * n_points]  = x;	

    }

}

`

Mar 01 '18 16:03 superchromix

How about the shared parameters part? (You forgot to remove the array indices in the second component by the way). Still confused how is that supposed to be done. Moreover it seems to me you used the same coordinate x to compute the function value, but the function depends from three scalar values and returns two components. Basically as I said in the comment I'm trying to implement:

f(x[0],x[1],x[2]) = {p[0]*x[0] + p[1]*x[1],p[0]*x[1] + p[2]*x[2]}

Mar 01 '18 16:03 lukkio88

So, if your function is

f(x,y,z) = (P0 * x + P1 * y, P0 * y + P2 * z)

Then you just have to choose how you want to organize the X, Y, and Z coordinates, and the two values of the function, in the input array. The same thing is done in the 2D Gaussian example.

one possible organization:

[Y0(x0,y0,z0),Y1(x0,y0,z0),Y0(x1,y0,z0),Y1(x1,y0,z0),........]

it's up to you to decide this.

Mar 01 '18 16:03 superchromix

If you have unequal numbers of X, Y, and Z values, then you would also have to specify the number of X, Y, and Z points to the model function, via user_info for example, or the function has no way of knowing what the x,y,z coordinates are.

Mar 01 '18 16:03 superchromix

So, if your function is f(x,y,z) = (P0 * x + P1 * y, P0 * x + P2 * y)

No, the function is

f(x,y,z) = (P0x + P1y,P0y + P2z);

For the coordinates organization I'm assuming this

x[0],y[0],z[0],x[1],y[1],z[1],...,x[i],y[i],z[i],....

If N are my samples, as you said I need an array 'v' of size 3*N, therefore I can retrive the coordinates x,y,z of the i-th sample as

(x,y,z) = (v[3i],v[3i+1],v[3*i+2]).

Mar 01 '18 16:03 lukkio88

If your function lives in a three dimensional space, then the total data size must be N_x * N_y * N_z, where N_x is the number of X coordinates, etc.

Unless, that is, your data is unevenly sampled, in which case you need to use the user_info parameter to pass in the coordinates.

Mar 01 '18 16:03 superchromix

If your function lives in a three dimensional space, then the total data size must be N_x * N_y * N_z, where N_x is the number of X coordinates, etc.

I'm assuming N_x = N_y = N_y, so I'll have a total of N_x^3 samples.

Also I've modified my function model as follows

__device__ void calculate_vector_function(
    float const * parameters,
    int const n_fits,
    int const n_points,
    float * value,
    float * derivative,
    int const point_index,
    int const fit_index,
    int const chunk_index,
    char * user_info,
    std::size_t const user_info_size) {
	
	// indices
	int const n_points_x = cbrt((float)n_points);
	int const point_index_z = point_index / (n_points_x*n_points_x);
	int const point_index_y = (point_index - point_index_z*n_points_x*n_points_x)/n_points_x;
    int const point_index_x = point_index - point_index_z*n_points_x*n_points_x - point_index_y * n_points_x;

	float const argx = point_index_x;
	float const argy = point_index_y;
	float const argz = point_index_z;

    // parameters

    float const * p = parameters;
	float * current_derivative = derivative + point_index;
    
    // values and jacobian
	if(point_index % 2 == 0) {

		value[point_index] = p[0]*argx + p[1]*argy; //first component
		
		current_derivative[0 * n_points]  = argx; //gradient first component
		current_derivative[1 * n_points]  = argy;
		current_derivative[2 * n_points]  = 0.0f;

	} else {

		value[point_index] = p[0]*argy + p[2]*argz; //second component
		
		current_derivative[0 * n_points]  = argy; //gradient second component
		current_derivative[1 * n_points]  = 0.0f;
		current_derivative[2 * n_points]  = argz;	

	}


	}

The reason for this bit:

// indices
	int const n_points_x = cbrt((float)n_points);
	int const point_index_z = point_index / (n_points_x*n_points_x);
	int const point_index_y = (point_index - point_index_z*n_points_x*n_points_x)/n_points_x;
    int const point_index_x = point_index - point_index_z*n_points_x*n_points_x - point_index_y * n_points_x;

It's because given that I have N^3 samples then the the the indexing in the linear array follows the formula

i*N^2 + j*N + k; //i index for x, j index for y, k index for z

The reason for this bit:

	float const argx = point_index_x;
	float const argy = point_index_y;
	float const argz = point_index_z;

It's because the gauss 2D function does something similar, not sure why though.

Mar 01 '18 16:03 lukkio88

The multi-dimensionality makes this "toy" model more complicated, but I think this is what you want.


__device__ void calculate_vector_function(
    float const * parameters,
    int const n_fits,
    int const n_points,
    float * value,
    float * derivative,
    int const point_index,
    int const fit_index,
    int const chunk_index,
    char * user_info,
    std::size_t const user_info_size) 

{
    // indices

    int_const n_function_values = 2;

    int const n_points_x = cbrt((float) (n_points/n_function_values));
    int const n_points_y = n_points_x;
    int const n_points_z = n_points_x;

    int const point_index_z = point_index / (n_points_x*n_points_y*n_function_values);

    int const point_index_y = (point_index 
				- (point_index_z*n_points_x*n_points_y*n_function_values) ) 
				/ (n_points_x*n_function_values);

    int const point_index_x = ( (point_index 
				- (point_index_z*n_points_x*n_points_y*n_function_values)) 
				- (point_index_y*n_points_x*n_function_values) ) 
				/ (n_function_values);

    float const x = (float) point_index_x;
    float const y = (float) point_index_y;
    float const z = (float) point_index_z;

    // parameters

    float const * p = parameters;
    
    // derivative

    float * current_derivative = derivative + point_index;

    // values and jacobian
    if(point_index % 2 == 0) {

        // even point indices - first element of the vector

        value[point_index] = p[0]*x + p[1]*y; //first component
        current_derivative[0 * n_points]  = x; //gradient first component
        current_derivative[1 * n_points]  = y;
        current_derivative[2 * n_points]  = 0.0f;

    } else {

        // odd point indices - second element of the vector

        value[point_index] = p[0]*y + p[2]*z; //second component
		
        current_derivative[0 * n_points]  = x; //gradient second component
        current_derivative[1 * n_points]  = 0.0f;
        current_derivative[2 * n_points]  = z;	

    }

}

Mar 01 '18 17:03 superchromix

And how do I use this for the fitting now? Should I fit just one function? or two? The code below is essentially an attempt to modify the gauss2D example in order to fit my vector field. Can you tell me if my set up (before the call to the gpufit function) is correct?

void vector_function_example() {
	/*
	Generating sample test for fitting a vector function
	*/


	// number of fits, fit points and parameters
	size_t const n_components = 2;
	size_t const n_fits = n_components * 1;;
	size_t const size_x = 50;
	size_t const n_points_per_fit = size_x * size_x* size_x;
	size_t const n_model_parameters = 3;

	// true parameters (amplitude, center x position, center y position, width, offset)
	std::vector< float > true_parameters{ 1.f, 1.5f, 2.5f};

	std::cout << "generate example data" << std::endl;

	// initialize random number generator
	std::mt19937 rng;
	rng.seed(0);
	std::uniform_real_distribution< float> uniform_dist(0, 1);

	// initial parameters (randomized)
	std::vector< float > initial_parameters(n_model_parameters);
	//for (size_t i = 0; i < n_fits; i++)
	//{
		for (size_t j = 0; j < n_model_parameters; j++)
		{
			initial_parameters[j] = true_parameters[j] + uniform_dist(rng); //This is probably wrong...
		}
	//}

	// generate x, y and z values
	std::vector< float > x(n_points_per_fit);
	std::vector< float > y(n_points_per_fit);
	std::vector< float > z(n_points_per_fit);
	for (size_t i = 0; i < size_x; i++)
	{
		for (size_t j = 0; j < size_x; j++) 
		{
			for(size_t k = 0; k < size_x; k++) {
				x[i * size_x*size_x + j*size_x + k] = static_cast<float>(k);
				y[i * size_x*size_x + j*size_x + k] = static_cast<float>(j);
				z[i * size_x*size_x + j*size_x + k] = static_cast<float>(i);
			}
		}
	}

	// generate test data with Poisson noise
	std::vector< float > temp(2*n_points_per_fit); //two components per fit
	generate_vector_function(x, y, z, true_parameters, temp);

	std::vector< float > data(n_fits * n_points_per_fit);
	for (size_t i = 0; i < n_fits; i++)
	{
		for (size_t j = 0; j < n_points_per_fit; j++)
		{
			std::poisson_distribution< int > poisson_dist(temp[j]+0.1);
			data[i * n_points_per_fit + j] = static_cast<float>(poisson_dist(rng));
		}
	}

	// tolerance
	float const tolerance = 0.001f;

	// maximum number of iterations
	int const max_number_iterations = 20;

	// estimator ID
	int const estimator_id = LSE;

	// model ID
	int const model_id = VECTOR_FUNCTION;

	// parameters to fit (all of them)
	std::vector< int > parameters_to_fit(n_model_parameters, 1);

	// output parameters
	std::vector< float > output_parameters(n_model_parameters);
	std::vector< int > output_states(n_fits);
	std::vector< float > output_chi_square(n_fits);
	std::vector< int > output_number_iterations(n_fits);

	// call to gpufit (C interface)
	std::chrono::high_resolution_clock::time_point time_0 = std::chrono::high_resolution_clock::now();
	int const status = gpufit
	(
		n_fits,
		n_points_per_fit,
		data.data(),
		0,
		model_id,
		initial_parameters.data(),
		tolerance,
		max_number_iterations,
		parameters_to_fit.data(),
		estimator_id,
		0,
		0,
		output_parameters.data(),
		output_states.data(),
		output_chi_square.data(),
		output_number_iterations.data()
	);

	std::chrono::high_resolution_clock::time_point time_1 = std::chrono::high_resolution_clock::now();

	// check status
	if (status != ReturnState::OK)
	{
		throw std::runtime_error(gpufit_get_last_error());
	}

	// print execution time
	std::cout << "execution time "
		<< std::chrono::duration_cast<std::chrono::milliseconds>(time_1 - time_0).count() << " ms" << std::endl;

	// get fit states
	std::vector< int > output_states_histogram(5, 0);
	for (std::vector< int >::iterator it = output_states.begin(); it != output_states.end(); ++it)
	{
		output_states_histogram[*it]++;
	}

	std::cout << "ratio converged              " << (float)output_states_histogram[0] / n_fits << "\n";
	std::cout << "ratio max iteration exceeded " << (float)output_states_histogram[1] / n_fits << "\n";
	std::cout << "ratio singular hessian       " << (float)output_states_histogram[2] / n_fits << "\n";
	std::cout << "ratio neg curvature MLE      " << (float)output_states_histogram[3] / n_fits << "\n";
	std::cout << "ratio gpu not read           " << (float)output_states_histogram[4] / n_fits << "\n";

	// compute mean of fitted parameters for converged fits
	std::vector< float > output_parameters_mean(n_model_parameters, 0);
	for (size_t i = 0; i != n_fits; i++)
	{
		if (output_states[i] == FitState::CONVERGED)
		{
			for (size_t j = 0; j < n_model_parameters; j++)
			{
				output_parameters_mean[j] += output_parameters[i * n_model_parameters + j];
			}
		}
	}
	// normalize
	for (size_t j = 0; j < n_model_parameters; j++)
	{
		output_parameters_mean[j] /= output_states_histogram[0];
	}

	// compute std of fitted parameters for converged fits
	std::vector< float > output_parameters_std(n_model_parameters, 0);
	for (size_t i = 0; i != n_fits; i++)
	{
		if (output_states[i] == FitState::CONVERGED)
		{
			for (size_t j = 0; j < n_model_parameters; j++)
			{
				output_parameters_std[j]
					+= (output_parameters[i * n_model_parameters + j] - output_parameters_mean[j])
					*  (output_parameters[i * n_model_parameters + j] - output_parameters_mean[j]);
			}
		}
	}
	// normalize and take square root
	for (size_t j = 0; j < n_model_parameters; j++)
	{
		output_parameters_std[j] = sqrt(output_parameters_std[j] / output_states_histogram[0]);
	}

	// print true value, fitted mean and std for every parameter
	for (size_t j = 0; j < n_model_parameters; j++)
	{
		std::cout
			<< "parameter " << j
			<< " true " << true_parameters[j]
			<< " fitted mean " << output_parameters_mean[j]
			<< " std " << output_parameters_std[j] << std::endl;
	}

	// compute mean chi-square for those converged
	float  output_chi_square_mean = 0;
	for (size_t i = 0; i != n_fits; i++)
	{
		if (output_states[i] == FitState::CONVERGED)
		{
			output_chi_square_mean += output_chi_square[i];
		}
	}
	output_chi_square_mean /= static_cast<float>(output_states_histogram[0]);
	std::cout << "mean chi square " << output_chi_square_mean << std::endl;

	// compute mean number of iterations for those converged
	float  output_number_iterations_mean = 0;
	for (size_t i = 0; i != n_fits; i++)
	{
		if (output_states[i] == FitState::CONVERGED)
		{
			output_number_iterations_mean += static_cast<float>(output_number_iterations[i]);
		}
	}
	// normalize
	output_number_iterations_mean /= static_cast<float>(output_states_histogram[0]);
	std::cout << "mean number of iterations " << output_number_iterations_mean << std::endl;
}

Mar 01 '18 17:03 lukkio88

Note that my modified example may not be the fastest solution - it is meant for illustrating the point. Also, the modulo operator is supposedly slow in CUDA.

Mar 01 '18 18:03 superchromix