McCode
McCode copied to clipboard
Interpolation-lib issue on GPU with kdtree + kdtree vs. regular differences
Something seems off between interp GPU/CPU regular/kdtree - needs further investigation. A first step is introducing both kdtree and regular runs on both CPU and GPU.
CPU:
GPU:
- so irrespective of simulated case something is off... :-)
* %Example: -n1e4 MF=flipfield.dat Detector: pol_6_I=1.0
* %Example: -n1e4 MF=constfield.dat Detector: polx_6_I=0.05
*
* %Parameters
* MF: [string] Field definition file
* zdepth: [m] Z-dimension of field
* interpol_method: [string] Choice of interpolation method "kdtree" (default on CPU) / "regular" (default on GPU)
edit: below is for flipfield.dat
, still looking at constfield.dat
...
I looked at the test, the test field has Z steps at -0.5, -0.25, -0.025, 0.025, 0.25, 0.5
meters, which is not constant_step
. The code should have rejected regular
interpolation and used kdtree
at this point, which it didn't.
https://github.com/McStasMcXtrace/McCode/blob/be6ee82680373ea8f69bdade8913bc1b99d8be6d/common/lib/share/interpolation-lib.c#L411-L421
I think there are two problems:
- Every table line gets checked against
interpolator->step[dim]
instead of every unique value. This should have caused the code to fall back tokdtree
even on regular data, however: - Line 420, the fallback only triggers if
interpolator->method
isNULL
or0
, which isn't the case if 'regular' is provided. If the data was forced to run onregular
mode, the code would have assumed every z step to be 0.25 m wide, which causes the field to be no longer symmetric about z=0 and thus the wavelength dependent precession, instead of both halves cancelling out.
constfield.dat
only has 12 data points, for regular
interpolation there should be 3 x 3 x 3 = 27 points. Unpopulated grid cells remain at their as-allocated state and are invalid, causing invalid results.
The two methods agree after filling out the missing values as such:
constfield.dat
# 12
-0.1 0.0 -0.5 0.0 0.0 -6.8e-05
0.1 0.0 -0.5 0.0 0.0 -6.8e-05
0.0 -0.1 -0.5 0.0 0.0 -6.8e-05
0.0 0.1 -0.5 0.0 0.0 -6.8e-05
-0.1 0.0 0.0 0.0 0.0 -6.8e-05
0.1 0.0 0.0 0.0 0.0 -6.8e-05
0.0 -0.1 0.0 0.0 0.0 -6.8e-05
0.0 0.1 0.0 0.0 0.0 -6.8e-05
-0.1 0.0 0.5 0.0 0.0 -6.8e-05
0.1 0.0 0.5 0.0 0.0 -6.8e-05
0.0 -0.1 0.5 0.0 0.0 -6.8e-05
0.0 0.1 0.5 0.0 0.0 -6.8e-05
0.0 0.0 -0.5 0.0 0.0 -6.8e-05
-0.1 -0.1 -0.5 0.0 0.0 -6.8e-05
-0.1 0.1 -0.5 0.0 0.0 -6.8e-05
0.1 -0.1 -0.5 0.0 0.0 -6.8e-05
0.1 0.1 -0.5 0.0 0.0 -6.8e-05
0.0 0.0 0.5 0.0 0.0 -6.8e-05
-0.1 -0.1 0.5 0.0 0.0 -6.8e-05
-0.1 0.1 0.5 0.0 0.0 -6.8e-05
0.1 -0.1 0.5 0.0 0.0 -6.8e-05
0.1 0.1 0.5 0.0 0.0 -6.8e-05
0.0 0.0 -0.0 0.0 0.0 -6.8e-05
-0.1 -0.1 -0.0 0.0 0.0 -6.8e-05
-0.1 0.1 -0.0 0.0 0.0 -6.8e-05
0.1 -0.1 -0.0 0.0 0.0 -6.8e-05
0.1 0.1 -0.0 0.0 0.0 -6.8e-05
It all makes sense and I believe you are absolutely right in your analysis, and a supplemental simulation dataset point in the same direction:
In http://tmp.mcstas.org/oneoff_2024-02-09 CPU and GPU seem to nicely agree. Tests run are now these, explicitly setting method to 'kdtree' and 'regular':
* %Example: -n1e4 interpol_method=kdtree MF=flipfield.dat Detector: pol_6_I=1.0
* %Example: -n1e4 interpol_method=kdtree MF=constfield.dat Detector: polx_6_I=0.05
* %Example: -n1e4 interpol_method=regular MF=flipfield.dat Detector: pol_6_I=1.0
* %Example: -n1e4 interpol_method=regular MF=constfield.dat Detector: polx_6_I=0.05
(Full dataset can be downloaded here http://tmp.mcstas.org/oneoff_2024-02-09.tgz ~ 2Mb for 4 tests on each "system")
As I remember, the "default" interpolator is in fact set to "something" on both CPU (kdtree) and GPU (regular) which goes against the logic of the above code that only switches if "NULL" was set.
Edit:
Yup, indeed the component has :
if (!strcmp(interpol_method,"default")) {
#ifdef OPENACC
sprintf(interpol_method,"regular");
#else
sprintf(interpol_method,"kdtree");
#endif
}
I should probably change that to "assume no default" / let things work adaptively, but again I seem to remember some specific issues in the GPU port on this point... How about you @ebknudsen ?
CPU run overview-plots, same order as in the instrument file:
GPU run overview-plots, same order as in the instrument file:
- So now they at least numerically agree, but still on the GPU it seems that the flipping is not functional... So still digging to do on my behalf.
I am now collecting old datasets from historical tests to check if the flipping actually ever worked on GPU...
- Which it did, e.g. in December 2023
(Unfortunately we've lost the nightly test system since ~ January 1st, but I believe it is safe to assume that perhaps some minor detail in the recent edits is not thread-safe... :-) I will get around to digging... But likely not until Feb 19th or so, still recovering from flu + kid is off on winter break next week.)
Edit: diff between generated code from then and now - differences are extremely small... Can't immediately spot what the issue is, but I am sure that once my brain is less "cloudy" I will nail it. :-)
4,5c4,5
< * Instrument: /home/nexmap/pkwi/TESTS/2023-12-20/McStas_8GPU_5e6/Test_Pol_Tabled/Test_Pol_Tabled.instr (Test_Pol_Tabled)
< * Date: Wed Dec 20 02:58:58 2023
---
> * Instrument: /home/nexmap/pkwi/WORK/oneoff_2024-02-09/20240209_0911_58/3.x-dev/Test_Pol_Tabled/Test_Pol_Tabled.instr (Test_Pol_Tabled)
> * Date: Fri Feb 9 09:12:01 2024
13c13
< #define MCCODE_STRING "McStas 3.x-dev - Dec. 19, 2023"
---
> #define MCCODE_STRING "McStas 3.x-dev - Feb. 08, 2024"
53d52
< long long _loopid; /* inner-loop event ID */
324c323
< # define MCCODE_STRING "McStas 3.x-dev - Dec. 19, 2023"
---
> # define MCCODE_STRING "McStas 3.x-dev - Feb. 08, 2024"
328c327
< # define MCCODE_DATE "Dec. 19, 2023"
---
> # define MCCODE_DATE "Feb. 08, 2024"
6238c6237
< char instrument_source[] = "/home/nexmap/pkwi/TESTS/2023-12-20/McStas_8GPU_5e6/Test_Pol_Tabled/Test_Pol_Tabled.instr";
---
> char instrument_source[] = "/home/nexmap/pkwi/WORK/oneoff_2024-02-09/20240209_0911_58/3.x-dev/Test_Pol_Tabled/Test_Pol_Tabled.instr";
6240c6239
< char instrument_code[] = "Instrument Test_Pol_Tabled source code /home/nexmap/pkwi/TESTS/2023-12-20/McStas_8GPU_5e6/Test_Pol_Tabled/Test_Pol_Tabled.instr is not embedded in this executable.\n Use --source option when running McStas.\n";
---
> char instrument_code[] = "Instrument Test_Pol_Tabled source code /home/nexmap/pkwi/WORK/oneoff_2024-02-09/20240209_0911_58/3.x-dev/Test_Pol_Tabled/Test_Pol_Tabled.instr is not embedded in this executable.\n Use --source option when running McStas.\n";
8147c8146
< int treek = tree->point->v[k];
---
> double treek = tree->point->v[k];
8183c8182,8193
< return ( *(double*)a - *(double*)b );
---
> if (*(double*)a > *(double*)b)
> {
> return 1;
> }
> else if (*(double*)a < *(double*)b)
> {
> return -1;
> }
> else
> {
> return 0;
> }
8307a8318
> interpolator->bin[dim] = 1;
8344a8356
> free(vector);
8383c8395
< indices[axis] = floor((x - interpolator->min[axis])/interpolator->step[axis]);
---
> indices[axis] = round((x - interpolator->min[axis])/interpolator->step[axis]);
8484c8496
< indices[axis] = (space[axis]-interpolator->min[axis])/interpolator->step[axis];
---
> indices[axis] = round((space[axis]-interpolator->min[axis])/interpolator->step[axis]);
10115c10127
< fprintf(stdout, "%d ", (int)(ncount*100.0/mcget_ncount())); fflush(stdout);
---
> fprintf(stdout, "%llu ", (unsigned long long)(ncount*100.0/mcget_ncount())); fflush(stdout);
10688c10700
< #define ABSORB0 do { DEBUG_ABSORB(); MAGNET_OFF; ABSORBED++; } while(0)
---
> #define ABSORB0 do { DEBUG_ABSORB(); MAGNET_OFF; ABSORBED++; return(ABSORBED);} while(0)
11001c11013
< particleN._uid = particleN._loopid = pidx;
---
> particleN._uid = pidx;
11074c11086
< _particle->_uid = _particle->_loopid = pidx;
---
> _particle->_uid = pidx;
@willend The new usage of round()
could change results. In the constant field case, x and y data are given at -0.1, 0.0 and 0.1, and the polarization detector is from -0.05 to 0.05 in both x and y axes. If round()
is used, all counted neutrons would fall inside the [-0.05, 0.05] bin, which is not populated in the original 12-line field file. But if the numbers are not rounded, half of the neutrons would fall into populated bins.
@sq-meng something is still iffy somewhere: Using kdtree does not yield the same results as regular in the case of the constant field - neither on CPU or GPU.
With the new file in place things look similar on GPU and CPU with "regular" interpolation - but with also with a different spin oscillation periodicity than earlier.
Regular interpolation, CPU:
Regular interpolation, GPU:
kdtree, CPU:
kdtree, GPU:
Edit: AND in fact, reverting changes to the qsort comparator function seems to partially rectify this?
Patched, CPU, regular:
Patched, CPU, kdtree:
Patched, GPU, regular:
GPU + kdtree simply doesn't work, and I suspect it might have never really...
So in conclusion, at the moment:
- kdtree does not function on GPU - less of a problem π‘
- regular provides the same output on CPU and GPU, which is good! π’
- kdtree and regular does not produce the same output on CPU - which is bad. π΄
Not judging what the "output should exactly be", given the simple constant-field datafile it should at least give the same irrespective of interpolation method - at least if we for now exclude the somewhat "experimental" GPU platform... π§ (BUT: We should probably go back with pen+paper and look at a "hand-held" propagation of neutrons of the relevant wavelengths through the relevant field to decide from a physics point of view...)
I think this indicates that either the comparator-function needs a revert / another overhaul - or at least something further needs doing on the kdtree side, also on CPU... Agreed @sq-meng ?
@willend I agree - something is still iffy. And to make things worse, I was not able to get the longer period oscillations on my machine with either kdtree or regularπ. This definitely still needs more work.
@willend Could you check if your constant field file has 27 lines? My original reply containing the field file missed last 5 lines and I edited to add these soon after. But if you got the unedited version, the z=0.0 plane would have remained unpopulated, explaining the longer period.
Right! I was missing the last block! I will rerun and see where this takes us on my side...
And you were right this updates the situation to:
- kdtree does not function on GPU - less of a problem π‘
- regular provides the same output on CPU and GPU, which is good! π’
- kdtree and regular does not produce the same output on CPU - which is good! π’
CPU regular:
CPU kdtree:
GPU regular:
Flipping the "CPU default" to "NULL" should make it adaptive - and keeping the GPU side to "regular" unless explicitly chosen otherwise makes sense for now.
One of the typical things that could give this type of difference is if some variable initially in the COMPONENT
scope (i.e. the Table object) gets modified during TRACE
- thus leading to memory/data overlap between the GPU threads... But I didn't spot that up to now...
Actual a VERY obvious candidate is the "R_SWAP"... macro.
I am afraid every neutron will either need its own copy of the interpolator - or a function with one or more #pragma acc atomic
claused need writing... (Then the global interpolator gets updated via "locks" - only access by one thread at a time...)
Can't comment on that as I have zero knowledge on GPU programming... But I reckon the tons of branches in kd-tree won't perform on GPU anyway. Glad we at least have the result mismatch situation closed!
I think my choice will in fact be to effectively disable "kdtree" on GPU. Exit with a warning if anything other than "default" / "regular" is chosen.
The other stuff will either become
a) very memory demanding, i.e. an "interpolator" copy pr. GPU thread or
b) Slow due to use of #pragma acc atomic
s (i.e. mutex'es that lock memory on acces to allow one single thread at a time...)
Heh, interestingly the size of flipfield.dat is not what it claims to be...
wc -l flipfield.dat
151 flipfield.dat
(mcstas-3.x-dev/miniconda3) data $ head -2 flipfield.dat
# 300
-0.1 -0.1 -0.5 0.0 0.0 1.0e-05
But I am not sure the # 300 header has any practical meaning in this particular case....
Ha! The remaining issue with the flipfield.dat file is apparently one of "irregular" array dimensions! Apparently there is an underlying assumption not only that the bins along each of the dimensions are uniform length, but also that Nx == Ny == Nz:
The file is 5x5x6 naturally which (forcing regular
- this occurs both on CPU and GPU) gives this output in the flip-case:
I have now written up a short Matlab (yes I know, better do it in Python ;-)) script to resample a field file: remesh.m.txt
Using this I was able to resample to e.g. 6x6x6: remeshed_6x6x6_flipfield.dat.txt
Which now performs correctly in both the CPU and GPU cases with regular
interpolation!
CPU:
GPU:
- which then both agree with
kdtree
on CPU:
I will simply replace the file with the remeshed one (and as I assumed, the header is not used for anything...)
I believe all that remains is either documenting these 'fragile' aspects or even exit fatal in the case the file properties are not "actually, fully regular" when requesting this method (meaning when forcing regular on CPU or "always" on GPU...
@sq-meng thanks for the ping-pong! :) I believe we have arrived at a good solution, so marking "ready for release". (This is my hack to help when I write a CHANGELOG on releasing.)
Please let me know if you find anything further that is off - and have a nice weekend!
@farhi this stuff is likely also important in several places in McXtrace, do try it out when you feel like it. In essence:
- @sq-meng provided bug fixes that could lead to erroneous interpolations both in the
kdtree
andregular
cases. - We found that the field-datafiles in McStas were actually not functioning for the
regular
case, they have been edited. - I have finally concluded that we will (very likely) never get
kdtree
to work on GPU and thus disabled it, so:
- On GPU the automatic
NULL
mode is mapped todefault
which meansregular
(NULL
remains the CPUdefault
) - One can still explicitly select
kdtree
on GPU, but this will lead to an error. - In
regular
mode (which may be selected explicitly or comes automatically on the GPU) there is a check for equal number of bins in the axis dimensions. Exit in error if this is not the case, with instructions to resample or usekdtree
https://github.com/McStasMcXtrace/McCode/blob/main/mcstas-comps/examples/Tests_polarization/Test_Pol_Tabled/ now has a couple of remeshing-tools and the input files alongside the instrument.
@willend Looks mostly good to me! I don't believe that regular
interpolation requires Nx == Ny == Nz though: Nsteps and step sizes can safely vary between axes, and the only requirement is that step size within each axis is consistent. The original flipfield.dat
has z-axis steps at -0.5, -0.25, -0.025, 0.025, 0.25, 0.5
, the middle 2 numbers break the 0.25 step size.
Attached is a 5x5x10 flip field which works fine here:
5-5-10_flipfield.zip
Ah, of course! I have simply overlooked the non-consistent axes bins...
(.... But of course resampling to a regular 6x6x6 worked :-) )
I will remodel the part now insisting on Nx == Ny == Nz to instead become a warning/error if regular
and inconsistent axis.(I feel leaving a solution in place that interpolates wrongly is bad customer service. )
Sure. Silently producing slightly wrong results is the worst behavior for scientific software so a warning against unexpected data is necessary.
Ended up with an actual exit(-1), which I think is for the better (here run with the original flipfield.dat):
(mcstas-3.x-dev/miniconda3) Test_Pol_Tabled-hack $ mcrun Test_Pol_Tabled.instr MF=flipfield.dat interpol_method=regular -n1e4
[Test_Pol_Tabled] Initialize
INFO: Set_pol(setpol): Setting polarization to (0.000000, 1.000000, 0.000000)
Opening input file 'flipfield.dat' (Table_Read_Offset)
interpolator_load: Axis 0: step=0.05, unique values=5, from file 'flipfield.dat'.
interpolator_load: Axis 1: step=0.05, unique values=5, from file 'flipfield.dat'.
!! interpolation-lib ERROR: !!
You are running the 'regular' interpolation scheme with a file of
non-consistent axis 'binning' along one or more axes.
This combination is not possible.
Please either resample the file to a regular grid or run with 'kdtree'
(NB: kdtree is available on CPU only)
Using default
reverts automatically to kdtree
and lets the simulation run.
BTW I also spotted yet another annoyance in the regular -> kdtree switch case logic: https://github.com/McStasMcXtrace/McCode/blob/195d18aa0bed5b06958c5923a58149aafda44ee7/common/lib/share/interpolation-lib.c#L434-L462
An if (this_step
was actually missing - so even on actual regular grids, the non-unique axis-positions from the "meshing" would trigger a switch to kdtree
:-)
All in all a good solution in place for all cases by now I believe. I thank you for contributing to this little exercise of bringing a useful but admittedly unpolished gem into much better order!