picongpu
picongpu copied to clipboard
[WIP] Mixed-precision radiation module
Intent
Current problem
Quantities such as position r
and observer direction n
are up-cast by default to 64 bit precision.
The detector direction n
is computed using 32 bit precision in radiationObserver.param
but converted to picongpu::float_64
vector at the end.
Proposal
The intent of the changes is to introduce mixed (single/double) precision computations into the radiation module of PIConGPU.
Ideally the usage of only sufficient precision for each separate computation will reduce register and memory consumption and allow more thread blocks to run in parallel.
Formal analysis suggests that most of the quantities can be kept in 32 bit precision (float_X
) and at best only the accumulators of inner products and the results have to remain in 64 bit precision.
Implementation
The basic data type of the Amplitude
class is templated such that it can be switched as desired.
Arguments to the functions that compute the separate parts of the amplitude are adapted to take positions and observation direction of 32 bit precision.
The only quantity that has to be stored in double precision is the retarded time t_ret_s
, even if it is computed from input quantities of lower precision.
Changing r
to a vector_32
type in Retarded_time_1
will inctroduce large errors.
Usage of different precision for accumulated amplitude values, frequencies and observation directions requires different radSplashType
definitions in writeHDF5file
, lest the values are computed properly, but stored with a type-mismatch (which will lead to confusing outputs).
Tests
Correctness
A first check of corretness was performed using the single-particle configuration of the Bunch
example of PIConGPU.
It was run for 7000 time steps, with radiation enabled from 2800 to 3000.
The quality of the computed amplitudes was tested by computing (for arbitrarily chosen steps 2900 and 3000) the maximum relative errors of the complex amplitudes in 2 and inf norms
as well as maximum deviation w.r.t. the maximum achieved norm of the reference amplitude:
This latter test does not emphasize deviations from reference solution in the regions of the spectrum where the intensity is negligible anyway.
The following errors were observed for the commit 50c0f50a at time step 3000:
|A_tgt - A_ref|_2 3.8007873728778394e-22
|A_tgt - A_ref|_inf 3.799745971153532e-22
|A_ref|_2 2.507966173029181e-17
|A_ref|_inf 2.507966173029181e-17
|A_tgt|_2 2.5079660813987757e-17
|A_tgt|_inf 2.5079660813987754e-17
|A_tgt - A_ref|_2 / |A_ref|_2 0.006734829014505873
|A_tgt - A_ref|_inf / |A_ref|_inf 0.006737196156335815
max(|A_tgt - A_ref|_2) / max(|A_ref|_2) 1.5154858999900937e-05
max(|A_tgt - A_ref|_inf) / max(|A_ref|_inf) 1.5150706624420332e-05
Hence the proper relative error is , whereas the error in units of maximum amplitude is
.
Performance
Performance of the implementation was tested using a thermal plasma (E = 0.511 [keV]) of density 2e23
with a grid of 1024 x 384 x 1024
cells distributed to 4 GPUs of one V100 node according (2 1 2
distribution) and 2 particles per cell. The simulation runs for 27 time steps with radiation enabled from the start. Only step 0 is written to disk.
The reference code required 1h 41min 48sec 233msec = 6108 sec
to complete the calculation.
The proposed changes reduce this to 1h 21min 10sec 663msec = 4870 sec
calculation time.
A reduction of run time of ~20% and a speed-up of ~1.25x
.
This comes with two observed caveats:
-
Increasing the number of jobs for the reduction in the radiation kernel
--e_radiation.numJobs 4
will result in the mixed-precision version taking longer to complete the same calculation as the reference version. REF (4 jobs):55 % = 15 | time elapsed: 40min 38sec 168msec | avg time per step: 2min 52sec 536msec
MPRAD (4 jobs):55 % = 15 | time elapsed: 42min 17sec 728msec | avg time per step: 2min 59sec 703msec
-
Usage of a 32 bit
amplitude_add
inRadiation.kernel
while working will slow the computation down significantly: REF (2 jobs):55 % = 15 | time elapsed: 52min 31sec 526msec | avg time per step: 3min 47sec 368msec
MPRAD (2 jobs, 64 bit increment):55 % = 15 | time elapsed: 42min 16sec 466msec | avg time per step: 2min 59sec 564msec
MPRAD (2 jobs, 32 bit increment):55 % = 15 | time elapsed: 1h 5min 57sec 342msec | avg time per step: 4min 49sec 640msec
This is likely due to the need to cast anAmplitude
object from 32 bit to 64 bit immediately after it is created, usingprecisionCast
. The latter produces a temporaryAmplitude
object with 64 bit precision of the components and constitutes an overhead.
@PrometheusPi I would be grateful for input regarding further testing.
@psychocoderHPC Is it possible to parametrize the data type (here float_64
) of the typename PICToSplash<float_64>::type radSplashType;
in Radiation.hpp
to always be determined by the type of of the data stored in detectorPositions
and detectorFrequencies
? Ideally without templating the entire class.
Currently I just introduced 3 different typename
statements, which does not benefit maintainability.
Further test for correctness
The version with 32 bit amplitude_add
was also tested on an example of a tilted, underdense (n_b = 0.9 n_0) Gaussian bunch impacting a homogeneous plasma of density 5.2e24
.
Boths simulations started from the same checkpoint at time step 12461. Radiation computation starts at time step 17000 while the bunch is still in vacuo. It impacts the background plasma around time step 19961.
In the following images the reference spectrum is plotted on the left, the new spectrum is shown on the right and their difference in between.
The top color bar pertains to the intensity spectra and the bottom to their difference (symm-log). The black arrow in the rightmost figure signifies the detector/observer direction and the magenta arrow the direction of propagation of the bunch).
As can be seen the difference appears significant for the detector being co-linear with the bunch propagation. Differences decay as soon as observer and bunch propagation are not collinear.
If one considers the intensity integrated over the frequencies for each time step the following are obtained for the observer directions shown above:
Note In this case the vertical axis is linear.
@Anton-Le I beautified your nice PR description, you can render latex in Github: https://gist.github.com/a-rodin/fef3f543412d6e1ec5b6cf55bf197d7b
typename PICToSplash<float_64>::type radSplashType
Please link the code you mentioned typename PICToSplash<float_64>::type radSplashType
@psychocoderHPC Is it possible to parametrize the data type (here
float_64
) of thetypename PICToSplash<float_64>::type radSplashType;
inRadiation.hpp
to always be determined by the type of of the data stored indetectorPositions
anddetectorFrequencies
? Ideally without templating the entire class. Currently I just introduced 3 differenttypename
statements, which does not benefit maintainability.
I do not understand the problem typename PICToSplash<MyRadType>::type radSplashType;
will be possible. You need to have MyRadType
somewhere in a header if you need to include it within many files.
I think the proposed changes in this PR seem very reasonable and do not require changing that many places in the code, so nothing against it. The following is a bit theoretical, and I point out its limitations below, but just to share and float an idea. We may have a similar pattern in SAXS as well.
I think it is possible to use only single precision for summation, if the accumulator variables are updated with compensated summation like 2sum
. I now found my own very old materians on that in Russian.
Basically, the idea is very simple, to copy relevant code fragments from that document:
// 2Sum(a, b): a + b= s + t
// https://en.wikipedia.org/wiki/2Sum
void twoSum(float a, float b, float& s, float& t){
s = a + b;
float a_ = s -b;
float b_ = s -a;
float da = a -a_;
float db = b -b_;
t = da + db;
}
float sumFloatCompensated(const float* x, int n){
float sum = x[0];
float c = 0;
for(int i = 1; i < n; ++i) {
float y = x[i] + c;
twoSum(sum, y, sum, c);
}
return sum;
}
Accuracy-wise it should be mostly same than a "stupid" summation with double
, as long as each separate value is well-represented as float
. For absurdly large number of terms will actually be better as there is no running accumulated error in this scheme (as it's compensated each time), but I don't think we are nearly close to that. So practically I think it would be same as adding everything up in double
. Performance-wise it is worse than just using double
, also worse register-wise and a couple more flops (the latter very negligible in the context of this plugin). But it may save GPU device memory for the results of summations as these are only float
. There are two obstacles to use it right now in the radiation plugin:
- The structure of computations there is such that each thread iterates first through all supercells, and then through its own frequencies. So with this loop order there needs to be an additional accumulator for each frequency handled by a thread, and that is not a fixed amount. If the loops were inversed: first frequencies, then supercells that would not be a problem, as then only a single register
float
would suffice - It needs to be checked how it works on GPUs and with optimization flags.
I think it is possible to use only single precision for summation, if the accumulator variables are updated with compensated summation like
2sum
. I now found my own very old materians on that in Russian.
Then there's two of us already. I originally thought we could use Kahan summation or similar. Although I was thinking more of using it in the core functions (i.e., vector/inner product) where the overhead due to the need for additional variables (hence registers) should be quite noticeable.
Accuracy-wise it should be mostly same than a "stupid" summation with
double
, as long as each separate value is well-represented asfloat
. For absurdly large number of terms will actually be better as there is no running accumulated error in this scheme (as it's compensated each time), but I don't think we are nearly close to that. So practically I think it would be same as adding everything up indouble
. Performance-wise it is worse than just usingdouble
, also worse register-wise and a couple more flops (the latter very negligible in the context of this plugin). But it may save GPU device memory for the results of summations as these are onlyfloat
. There are two obstacles to use it right now in the radiation plugin:* The structure of computations there is such that each thread iterates first through all supercells, and then through its own frequencies. So with this loop order there needs to be an additional accumulator for each frequency handled by a thread, and that is not a fixed amount. If the loops were inversed: first frequencies, then supercells that would not be a problem, as then only a single register `float` would suffice
The ideal point of application for this would be the accumulation of the amplitude increments, since each increment is contributed by a macro particle and there's a lot of them. Reordering the loops, though, will be extremely detrimental to performance, since instead of fetcing just 1 scalar (frequency) in an innermost loop we'd have to fetch vectors (momenta) and compute their changes (for acceleration terms).
* It needs to be checked how it works on GPUs and with optimization flags.
I tried this approach once on the CPU for a functional renrmalization group code from my alma-mater. At least on CPU with -O3
optimization but without --fast-math
the results were more precise.
@Anton-Le Could you please rebase this PR against the dev branch.
I think the proposed changes in this PR seem very reasonable and do not require changing that many places in the code, so nothing against it. The following is a bit theoretical, and I point out its limitations below, but just to share and float an idea. We may have a similar pattern in SAXS as well.
I think it is possible to use only single precision for summation, if the accumulator variables are updated with compensated summation like
2sum
. I now found my own very old materians on that in Russian. Basically, the idea is very simple, to copy relevant code fragments from that document:// 2Sum(a, b): a + b= s + t // https://en.wikipedia.org/wiki/2Sum void twoSum(float a, float b, float& s, float& t){ s = a + b; float a_ = s -b; float b_ = s -a; float da = a -a_; float db = b -b_; t = da + db; } float sumFloatCompensated(const float* x, int n){ float sum = x[0]; float c = 0; for(int i = 1; i < n; ++i) { float y = x[i] + c; twoSum(sum, y, sum, c); } return sum; }
Accuracy-wise it should be mostly same than a "stupid" summation with
double
, as long as each separate value is well-represented asfloat
. For absurdly large number of terms will actually be better as there is no running accumulated error in this scheme (as it's compensated each time), but I don't think we are nearly close to that. So practically I think it would be same as adding everything up indouble
. Performance-wise it is worse than just usingdouble
, also worse register-wise and a couple more flops (the latter very negligible in the context of this plugin). But it may save GPU device memory for the results of summations as these are onlyfloat
. There are two obstacles to use it right now in the radiation plugin:
- The structure of computations there is such that each thread iterates first through all supercells, and then through its own frequencies. So with this loop order there needs to be an additional accumulator for each frequency handled by a thread, and that is not a fixed amount. If the loops were inversed: first frequencies, then supercells that would not be a problem, as then only a single register
float
would suffice- It needs to be checked how it works on GPUs and with optimization flags.
Let me link here also sum older discussions: https://github.com/ComputationalRadiationPhysics/picongpu/issues/523#issuecomment-70630415
We should add somewhere a misc function/type which helps with the creation of stable sums.
@Anton-Le Could you please rebase this PR against the dev branch.
I planned to do so once I am more sure that the changes don't break things beyond repair.
@Anton-Le What is the state of this PR?
@Anton-Le What is the status of this pull request (draft)? Where you using this code in your Juwels Booster runs?
@Anton-Le What is the state of this PR? Should we close it or could we work on bringing it into the mainline. The next release is coming soon!
I will close this PR, after 2 years without any response I assume there is no need for the changes.