scream Re: parallel_scan in Homme's VR, with implications for min(qv) sign on the GPU

I've been looking into why on the GPU one sees small negative values in tracers essentially always, like this:

nstep=          18  time=   3.0000000000000000       [h]
 qv(  1)=   0.1926266828667951E-07  0.1826945526532773E-01  0.2901083593460567E+04
 qv(  2)=   0.4000079130015803E-03  0.1762019303117301E+01  0.2258265292179765E+05
 qv(  3)=  -0.4213884406191319E-19  0.1035167517576999E-03  0.6355898909435548E+00
 qv(  4)=  -0.3574117940616535E-18  0.9124051363824183E-04  0.1926603290487100E+00
 qv(  5)=  -0.1376520031359005E-17  0.2415166293711146E-03  0.6341793319363298E+01
 qv(  6)=  -0.3643574640815085E-18  0.2751749738338576E-03  0.2866435749020836E+01
 qv(  7)=  -0.1319412340373359E-10  0.6437255445306087E+05  0.1824205976484502E+09
 qv(  8)=  -0.5813437579250631E-08  0.6256773700560137E+07  0.3923053281685970E+11
 qv(  9)=  -0.4036988977580842E-06  0.5680618683170891E+09  0.3540619969682540E+13
 qv( 10)=  -0.3295857185017364E-21  0.1016170136320485E-06  0.3024998950507378E-03

On the CPU in E3SMv2, 0 is the lower bound. In SCREAMv1 on the CPU and with SL transport, I'm seeing subnormal numbers sneak in, permitting output of this sort:

 nstep=         144  time=   24.000000000000000       [h]
 qv(  1)=   0.1865825070438324E-06  0.1866372634564622E-01  0.2980669934269725E+04
 qv(  2)=   0.4786438684226476E-03  0.1989536332392150E+01  0.2562414453312505E+05
 qv(  3)=  -0.4940656458412465-323  0.1327043325262754E-03  0.1490110174540343E+01
 qv(  4)=  -0.4940656458412465-323  0.1281681149669015E-03  0.3428111885826745E+00
 qv(  5)=  -0.4940656458412465-323  0.1569366317677900E-03  0.4911452672364300E+01
 qv(  6)=  -0.4940656458412465-323  0.3410646197664531E-03  0.4147104385971611E+01
 qv(  7)=  -0.4940656458412465-323  0.2027537387814544E+06  0.8064498735123779E+09
 qv(  8)=  -0.4940656458412465-323  0.1657162239574792E+07  0.1615718196295061E+11
 qv(  9)=  -0.4940656458412465-323  0.5587459604028097E+09  0.4225859830242391E+13

(With Eulerian transport, more general negative values occur, masking this VR issue.) This second type of negative q value is unrelated to the first. Both cases are due to vertical remap (VR); I've confirmed in multiple ways that SL transport maintains q >= 0 always.

It turns out that the cause is due to a chunk of code I've contemplated getting rid of for a while. For simplicity I'll quote the F90 version rather than the C++ one: https://github.com/E3SM-Project/scream/blob/9915af095b7375eff2ae680e988398cf4751b3c4/components/homme/src/share/vertremap_base.F90#L608-L613 masso is then used here: https://github.com/E3SM-Project/scream/blob/9915af095b7375eff2ae680e988398cf4751b3c4/components/homme/src/share/vertremap_base.F90#L638-L644 The idea is to use accumulated mass on the old grid up to the rightmost old grid point possible, then evaluate one integral for the remaining bit of overlap. This is clever in the sense that it avoids one integral per new cell, plus a little bookkeeping. But it's perhaps not so clever in that one is accumulating large numbers and then subtracting them, leading to both truncation round-off error (in the accumulation in the first loop) and then cancellation error (in the subtraction in the second loop). For this latter set of reasons I've thought about rewriting the code to just do the straightforward integrals per new cell.

It turns out this scan has a second issue when done on GPU: a parallel_scan on finite-precision numbers in general can't assure a monotone output array for a nonnegative input array, because the order of summation differs for each output index. (In fact, I think one could implement a decent parallel_scan to avoid this, but my understanding is that efficient implementations are faster but then have this problem.) The subtraction in the second loop can then lead to a negative mass, as seen in the first output block above. This is at the machine-eps level, of course, but it would be nice to avoid it, especially since if we don't, then SCREAM will clip the values, which I don't think any of us wants. As a separate point, a scan is expensive on the GPU, so it would be nice to avoid it.

I'm interested in what others think about my reworking both the F90 and C++ VR code to avoid the scan. I'm thinking about making it an E3SMv3 non-BFB change. Alternatively, I'd make it optional and we'd use it just in SCREAM for now.

Aug 04 '22 21:08 ambrad

@mt, @oksanaguba, @bartgol I'm interested in your thoughts on the above.

Aug 04 '22 21:08 ambrad

If I follow, you would make all levels indep of each other. For each level, you would have to do 2 parabula integrations, plus potentially add mass from old cells fully contained in the new one, right? So you end up doing integration more times, but you gain in ||Ism (and roundoff precision), correct?

I think this is reasonable.

Aug 04 '22 21:08 bartgol

@bartgol, yes, exactly.

Aug 04 '22 21:08 ambrad

Btw, while at it, we might also change the impl of the parabola integration, from containing difference of squares/cubes to use factorized expressions. Not a big deal, probably, but also easy to do, so why not.

Aug 04 '22 21:08 bartgol