Intent

Current problem

Quantities such as position r and observer direction n are up-cast by default to 64 bit precision. The detector direction n is computed using 32 bit precision in radiationObserver.param but converted to picongpu::float_64 vector at the end.

Proposal

The intent of the changes is to introduce mixed (single/double) precision computations into the radiation module of PIConGPU.

Ideally the usage of only sufficient precision for each separate computation will reduce register and memory consumption and allow more thread blocks to run in parallel.

Formal analysis suggests that most of the quantities can be kept in 32 bit precision (float_X) and at best only the accumulators of inner products and the results have to remain in 64 bit precision.

Implementation

The basic data type of the Amplitude class is templated such that it can be switched as desired. Arguments to the functions that compute the separate parts of the amplitude are adapted to take positions and observation direction of 32 bit precision.

The only quantity that has to be stored in double precision is the retarded time t_ret_s, even if it is computed from input quantities of lower precision.

Changing r to a vector_32 type in Retarded_time_1 will inctroduce large errors.

Usage of different precision for accumulated amplitude values, frequencies and observation directions requires different radSplashType definitions in writeHDF5file, lest the values are computed properly, but stored with a type-mismatch (which will lead to confusing outputs).

Tests

Correctness

A first check of corretness was performed using the single-particle configuration of the Bunch example of PIConGPU. It was run for 7000 time steps, with radiation enabled from 2800 to 3000.

The quality of the computed amplitudes was tested by computing (for arbitrarily chosen steps 2900 and 3000) the maximum relative errors of the complex amplitudes in 2 and inf norms

$\max_{\vec{n},\omega} \left( \Vert \vec{A}_{new} - \vec{A}_{ref} \Vert_{2,\infty} / \Vert \vec{A}_{ref} \Vert_{2,\infty} \right)$

as well as maximum deviation w.r.t. the maximum achieved norm of the reference amplitude:

$\max_{\vec{n},\omega} \left( \Vert \vec{A}_{new} - \vec{A}_{ref} \Vert_{2,\infty} \right) / \max_{\vec{n},\omega} \left( \Vert \vec{A}_{ref} \Vert_{2,\infty} \right)$

This latter test does not emphasize deviations from reference solution in the regions of the spectrum where the intensity is negligible anyway.

The following errors were observed for the commit 50c0f50a at time step 3000:

    |A_tgt - A_ref|_2  3.8007873728778394e-22
    |A_tgt - A_ref|_inf  3.799745971153532e-22
    |A_ref|_2  2.507966173029181e-17
    |A_ref|_inf  2.507966173029181e-17
    |A_tgt|_2  2.5079660813987757e-17
    |A_tgt|_inf  2.5079660813987754e-17
    |A_tgt - A_ref|_2 / |A_ref|_2 0.006734829014505873
    |A_tgt - A_ref|_inf / |A_ref|_inf 0.006737196156335815
    max(|A_tgt - A_ref|_2) / max(|A_ref|_2) 1.5154858999900937e-05
    max(|A_tgt - A_ref|_inf) / max(|A_ref|_inf) 1.5150706624420332e-05

Hence the proper relative error is $6.73\cdot 10^{-3}$ , whereas the error in units of maximum amplitude is $1.52\cdot 10^{-5}$ .

Performance

Performance of the implementation was tested using a thermal plasma (E = 0.511 [keV]) of density 2e23 with a grid of 1024 x 384 x 1024 cells distributed to 4 GPUs of one V100 node according (2 1 2 distribution) and 2 particles per cell. The simulation runs for 27 time steps with radiation enabled from the start. Only step 0 is written to disk.

The reference code required 1h 41min 48sec 233msec = 6108 sec to complete the calculation. The proposed changes reduce this to 1h 21min 10sec 663msec = 4870 sec calculation time.

A reduction of run time of ~20% and a speed-up of ~1.25x.

This comes with two observed caveats:

Increasing the number of jobs for the reduction in the radiation kernel --e_radiation.numJobs 4 will result in the mixed-precision version taking longer to complete the same calculation as the reference version. REF (4 jobs): 55 % = 15 | time elapsed: 40min 38sec 168msec | avg time per step: 2min 52sec 536msec MPRAD (4 jobs): 55 % = 15 | time elapsed: 42min 17sec 728msec | avg time per step: 2min 59sec 703msec
Usage of a 32 bit amplitude_add in Radiation.kernel while working will slow the computation down significantly: REF (2 jobs): 55 % = 15 | time elapsed: 52min 31sec 526msec | avg time per step: 3min 47sec 368msec MPRAD (2 jobs, 64 bit increment): 55 % = 15 | time elapsed: 42min 16sec 466msec | avg time per step: 2min 59sec 564msec MPRAD (2 jobs, 32 bit increment): 55 % = 15 | time elapsed: 1h 5min 57sec 342msec | avg time per step: 4min 49sec 640msec This is likely due to the need to cast an Amplitude object from 32 bit to 64 bit immediately after it is created, using precisionCast. The latter produces a temporary Amplitude object with 64 bit precision of the components and constitutes an overhead.

Feb 11 '21 12:02 Anton-Le

@PrometheusPi I would be grateful for input regarding further testing. @psychocoderHPC Is it possible to parametrize the data type (here float_64) of the typename PICToSplash<float_64>::type radSplashType; in Radiation.hpp to always be determined by the type of of the data stored in detectorPositions and detectorFrequencies ? Ideally without templating the entire class. Currently I just introduced 3 different typename statements, which does not benefit maintainability.

Feb 11 '21 12:02 Anton-Le

Further test for correctness

The version with 32 bit amplitude_add was also tested on an example of a tilted, underdense (n_b = 0.9 n_0) Gaussian bunch impacting a homogeneous plasma of density 5.2e24.

Boths simulations started from the same checkpoint at time step 12461. Radiation computation starts at time step 17000 while the bunch is still in vacuo. It impacts the background plasma around time step 19961.

In the following images the reference spectrum is plotted on the left, the new spectrum is shown on the right and their difference in between. The top color bar pertains to the intensity spectra and the bottom to their difference (symm-log). The black arrow in the rightmost figure signifies the detector/observer direction and the magenta arrow the direction of propagation of the bunch). CMP_CumulativeIntensity_detectorNr_0

CMP_CumulativeIntensity_detectorNr_3

As can be seen the difference appears significant for the detector being co-linear with the bunch propagation. Differences decay as soon as observer and bunch propagation are not collinear.

If one considers the intensity integrated over the frequencies for each time step the following are obtained for the observer directions shown above: CMP_IntegratedIntensityVsTimeStep_0

Note In this case the vertical axis is linear.

Feb 11 '21 12:02 Anton-Le

@Anton-Le I beautified your nice PR description, you can render latex in Github: https://gist.github.com/a-rodin/fef3f543412d6e1ec5b6cf55bf197d7b

Feb 11 '21 13:02 psychocoderHPC

typename PICToSplash<float_64>::type radSplashType

Please link the code you mentioned typename PICToSplash<float_64>::type radSplashType

Feb 11 '21 15:02 psychocoderHPC

@psychocoderHPC Is it possible to parametrize the data type (here float_64) of the typename PICToSplash<float_64>::type radSplashType; in Radiation.hpp to always be determined by the type of of the data stored in detectorPositions and detectorFrequencies ? Ideally without templating the entire class. Currently I just introduced 3 different typename statements, which does not benefit maintainability.

I do not understand the problem typename PICToSplash<MyRadType>::type radSplashType; will be possible. You need to have MyRadType somewhere in a header if you need to include it within many files.

Feb 11 '21 15:02 psychocoderHPC

I think the proposed changes in this PR seem very reasonable and do not require changing that many places in the code, so nothing against it. The following is a bit theoretical, and I point out its limitations below, but just to share and float an idea. We may have a similar pattern in SAXS as well.

I think it is possible to use only single precision for summation, if the accumulator variables are updated with compensated summation like 2sum. I now found my own very old materians on that in Russian. Basically, the idea is very simple, to copy relevant code fragments from that document:

// 2Sum(a, b): a + b= s + t
// https://en.wikipedia.org/wiki/2Sum
void twoSum(float a, float b, float& s, float& t){
    s = a + b;
    float a_ = s -b;
    float b_ = s -a;
    float da = a -a_;
    float db = b -b_;
    t = da + db;
}

float sumFloatCompensated(const float* x, int n){
    float sum = x[0];
    float c = 0;
    for(int i = 1; i < n; ++i) {
        float y = x[i] + c;
        twoSum(sum, y, sum, c);
    }
    return sum;
}

Accuracy-wise it should be mostly same than a "stupid" summation with double, as long as each separate value is well-represented as float. For absurdly large number of terms will actually be better as there is no running accumulated error in this scheme (as it's compensated each time), but I don't think we are nearly close to that. So practically I think it would be same as adding everything up in double. Performance-wise it is worse than just using double, also worse register-wise and a couple more flops (the latter very negligible in the context of this plugin). But it may save GPU device memory for the results of summations as these are only float. There are two obstacles to use it right now in the radiation plugin:

The structure of computations there is such that each thread iterates first through all supercells, and then through its own frequencies. So with this loop order there needs to be an additional accumulator for each frequency handled by a thread, and that is not a fixed amount. If the loops were inversed: first frequencies, then supercells that would not be a problem, as then only a single register float would suffice
It needs to be checked how it works on GPUs and with optimization flags.

Feb 17 '21 16:02 sbastrakov

I think it is possible to use only single precision for summation, if the accumulator variables are updated with compensated summation like 2sum. I now found my own very old materians on that in Russian.

Then there's two of us already. I originally thought we could use Kahan summation or similar. Although I was thinking more of using it in the core functions (i.e., vector/inner product) where the overhead due to the need for additional variables (hence registers) should be quite noticeable.

Accuracy-wise it should be mostly same than a "stupid" summation with double, as long as each separate value is well-represented as float. For absurdly large number of terms will actually be better as there is no running accumulated error in this scheme (as it's compensated each time), but I don't think we are nearly close to that. So practically I think it would be same as adding everything up in double. Performance-wise it is worse than just using double, also worse register-wise and a couple more flops (the latter very negligible in the context of this plugin). But it may save GPU device memory for the results of summations as these are only float. There are two obstacles to use it right now in the radiation plugin:
* The structure of computations there is such that each thread iterates first through all supercells, and then through its own frequencies. So with this loop order there needs to be an additional accumulator for each frequency handled by a thread, and that is not a fixed amount. If the loops were inversed: first frequencies, then supercells that would not be a problem, as then only a single register `float` would suffice

The ideal point of application for this would be the accumulation of the amplitude increments, since each increment is contributed by a macro particle and there's a lot of them. Reordering the loops, though, will be extremely detrimental to performance, since instead of fetcing just 1 scalar (frequency) in an innermost loop we'd have to fetch vectors (momenta) and compute their changes (for acceleration terms).

* It needs to be checked how it works on GPUs and with optimization flags.

I tried this approach once on the CPU for a functional renrmalization group code from my alma-mater. At least on CPU with -O3 optimization but without --fast-math the results were more precise.

Feb 18 '21 08:02 Anton-Le

@Anton-Le Could you please rebase this PR against the dev branch.

Mar 01 '21 07:03 psychocoderHPC

I think the proposed changes in this PR seem very reasonable and do not require changing that many places in the code, so nothing against it. The following is a bit theoretical, and I point out its limitations below, but just to share and float an idea. We may have a similar pattern in SAXS as well.

I think it is possible to use only single precision for summation, if the accumulator variables are updated with compensated summation like 2sum. I now found my own very old materians on that in Russian. Basically, the idea is very simple, to copy relevant code fragments from that document:
// 2Sum(a, b): a + b= s + t
// https://en.wikipedia.org/wiki/2Sum
void twoSum(float a, float b, float& s, float& t){
    s = a + b;
    float a_ = s -b;
    float b_ = s -a;
    float da = a -a_;
    float db = b -b_;
    t = da + db;
}

float sumFloatCompensated(const float* x, int n){
    float sum = x[0];
    float c = 0;
    for(int i = 1; i < n; ++i) {
        float y = x[i] + c;
        twoSum(sum, y, sum, c);
    }
    return sum;
}
Accuracy-wise it should be mostly same than a "stupid" summation with double, as long as each separate value is well-represented as float. For absurdly large number of terms will actually be better as there is no running accumulated error in this scheme (as it's compensated each time), but I don't think we are nearly close to that. So practically I think it would be same as adding everything up in double. Performance-wise it is worse than just using double, also worse register-wise and a couple more flops (the latter very negligible in the context of this plugin). But it may save GPU device memory for the results of summations as these are only float. There are two obstacles to use it right now in the radiation plugin:

The structure of computations there is such that each thread iterates first through all supercells, and then through its own frequencies. So with this loop order there needs to be an additional accumulator for each frequency handled by a thread, and that is not a fixed amount. If the loops were inversed: first frequencies, then supercells that would not be a problem, as then only a single register float would suffice

It needs to be checked how it works on GPUs and with optimization flags.

Let me link here also sum older discussions: https://github.com/ComputationalRadiationPhysics/picongpu/issues/523#issuecomment-70630415

We should add somewhere a misc function/type which helps with the creation of stable sums.

Mar 01 '21 07:03 psychocoderHPC

@Anton-Le Could you please rebase this PR against the dev branch.

I planned to do so once I am more sure that the changes don't break things beyond repair.

Mar 02 '21 22:03 Anton-Le

@Anton-Le What is the state of this PR?

Apr 16 '21 10:04 psychocoderHPC

@Anton-Le What is the status of this pull request (draft)? Where you using this code in your Juwels Booster runs?

Apr 29 '21 20:04 PrometheusPi

@Anton-Le What is the state of this PR? Should we close it or could we work on bringing it into the mainline. The next release is coming soon!

Aug 31 '21 16:08 psychocoderHPC

I will close this PR, after 2 years without any response I assume there is no need for the changes.

Mar 02 '23 16:03 psychocoderHPC

picongpu
picongpu copied to clipboard

[WIP] Mixed-precision radiation module

Intent

Current problem

Proposal

Implementation

Tests

Correctness

Performance

Further test for correctness

picongpu picongpu copied to clipboard

[WIP] Mixed-precision radiation module

Intent

Current problem

Proposal

Implementation

Tests

Correctness

Performance

Further test for correctness

picongpu
picongpu copied to clipboard