espresso Thermalized LB GPU looses mass

Density vs time:

1010.0 0.999999602675 0.000322047627491
2010.0 0.99999923265 0.000309079308688
3010.00000002 0.999998866677 0.000354693064748
4010.00000004 0.999998498201 0.000269992543406
[...]
243010.000074 0.999889917135 0.000340308489128
244010.000075 0.999889334023 0.00027192450585
245010.000076 0.999888790131 0.000263980095021
246010.000077 0.999888206244 0.00035571453468
247010.000078 0.999887622297 0.000339340864591
248010.000079 0.999887059152 0.000274875165011
249010.00008 0.99988651675 0.000318489413579
250010.000081 0.999885953069 0.000288159735111

Since single precision accuracy is ~10E-7, this may or may not be due to numerics.

The script was (with time step 0.01):

import espressomd
from espressomd.lb import LBFluidGPU, LBFluid

import numpy as np
import sys
from time import time

l=10

s=espressomd.System(box_l=[l,l,l])
s.cell_system.skin=0.01
s.set_random_state_PRNG()
#s.part.add(pos=np.random.random((100,3)))

dt=float(sys.argv[1])
lb_class=None
if sys.argv[2]=="cpu":
    lb_class=LBFluid
elif sys.argv[2]=="gpu":
    lb_class=LBFluidGPU
else:
  raise Exception()

s.time_step=dt
lbf =lb_class(agrid=1,visc=1,dens=1,tau=dt,fric=1)
s.thermostat.set_lb(kT=1)
s.actors.add(lbf)

s.integrator.run(1000)
nodes=[]
for i in range(10):
  for j in range(10):
    for k in range(10):
      nodes.append(lbf[i,j,k])
  
n_nodes=len(nodes)

for i in range(250):
  s.integrator.run(100000)
  v=[]
  for n in nodes:
    v.append(n.density[0])
  print s.time, np.mean(v),np.var(v)

Oct 22 '18 13:10 RudolfWeeber

The script was:

import espressomd
from espressomd.lb import LBFluidGPU, LBFluid

import numpy as np
import sys
from time import time

l=10

s=espressomd.System(box_l=[l,l,l])
s.cell_system.skin=0.01
s.set_random_state_PRNG()
#s.part.add(pos=np.random.random((100,3)))

dt=float(sys.argv[1])
lb_class=None
if sys.argv[2]=="cpu":
    lb_class=LBFluid
elif sys.argv[2]=="gpu":
    lb_class=LBFluidGPU
else:
  raise Exception()

s.time_step=dt
lbf =lb_class(agrid=1,visc=1,dens=1,tau=dt,fric=1)
s.thermostat.set_lb(kT=1)
s.actors.add(lbf)

s.integrator.run(1000)
nodes=[]
for i in range(10):
  for j in range(10):
    for k in range(10):
      nodes.append(lbf[i,j,k])
  
n_nodes=len(nodes)

for i in range(250):
  s.integrator.run(100000)
  v=[]
  for n in nodes:
    v.append(n.density[0])
  print s.time, np.mean(v),np.var(v)

Oct 22 '18 13:10 RudolfWeeber

has to be checked after #2321 is closed.

Oct 26 '18 07:10 KaiSzuttor

is due to floating point precision. We will make the GPU implementation use double.

Oct 29 '18 16:10 KaiSzuttor

@KaiSzuttor, please provide before/after numbers on different GPU generations. Here's a list of GPUs we have and some candidate computers: GTX 680 (wisent, impala), Titan Black (bee), GTX 780 Ti (gaur, tahr), GTX 980 (chiru, gnu), GTX 1070 (wapiti), GTX 1080 Ti (cipXX), Vega 64 (lama, requires recompiling).

The 600 and 700 series had 1/24th double precision performance, the 900 and 1000 series has 1/32th. The Titan Black has 1/3rd. If we are memory-bound and not compute-bound, we'd expect 1/2.

Oct 29 '18 16:10 mkuron

Does the temperature fluctuation width issue also go away with double precision?

The places where a lot of summing takes place are mainly the mode<->population transformations. Maybe a Kahan summation in a few strategic places could allow us to continue with single precision.

Oct 29 '18 17:10 RudolfWeeber

GPU Timing float Timing double ratio float/double

GeForce GTX 780 Ti 6.8347299099e-05 0.000103395255566 0.6610293550207637

GeForce GTX 980 5.94442596436e-05 9.95463490486e-05 0.5971515802611549

GeForce GTX 1070 7.30076231956e-05 0.000126591158867 0.5767197634418034

GeForce GTX 1080 Ti 3.57395396233e-05 5.6533952713e-05 0.6321783266196719

GPU	Timing float	Timing double	ratio float/double
GeForce GTX 780 Ti	6.8347299099e-05	0.000103395255566	0.6610293550207637
GeForce GTX 980	5.94442596436e-05	9.95463490486e-05	0.5971515802611549
GeForce GTX 1070	7.30076231956e-05	0.000126591158867	0.5767197634418034
GeForce GTX 1080 Ti	3.57395396233e-05	5.6533952713e-05	0.6321783266196719

That seems not too bad. On short timescale (~month), I personally think that going to double on the GPU is the only fix I personally can contribute.

Script used:

from __future__ import print_function

import numpy as np
import time

import espressomd
import espressomd.lb

N_SAMPLES = 50
N_STEPS = 10000

AGRID = 0.5
DENS = 1.0
VISC = 1.0
FRIC = 1.0
TAU = 0.01
LB_PARAMETERS = {'agrid': AGRID,
                 'dens': DENS,
                 'visc': VISC,
                 'fric': FRIC,
                 'tau': TAU}

system = espressomd.System(box_l = [10.0]*3)
system.time_step = TAU
system.cell_system.skin = 0.4 * AGRID

lbf = espressomd.lb.LBFluidGPU(**LB_PARAMETERS)
system.actors.add(lbf)
system.thermostat.set_lb(kT=1.0)

timings = np.zeros(N_SAMPLES)
for i in range(N_SAMPLES):
    start = time.time()
    system.integrator.run(N_STEPS)
    stop = time.time()
    timings[i] = (stop - start) / float(N_STEPS)

print(np.mean(timings), np.var(timings))

Oct 29 '18 19:10 KaiSzuttor

~~I haven't checked for the temperature fluctuation yet.~~ Using doubles, the lb.py testcase passes when using the same precision requirements for CPU/GPU LB.

Oct 29 '18 19:10 KaiSzuttor

We would need to essentially replace float with a custom Kahan-summed datatype so we can keep the correction term for the entirety of the simulation. If we only did Kahan summation inside the mode transformation and then convert back to a plain float, we would only gain 1-2 orders of magnitude in precision, which is not enough.

Oct 30 '18 08:10 mkuron

Update: after eliminating the leftover floats, the situations look quite different

GPU	Timing float	Timing double	ratio double/float
GeForce GTX 680	1.78741159439e-05	9.24656462669e-05	5.173159140128341
GeForce GTX 780 Ti	1.5184492588e-05	6.42595734596e-05	4.23192102648
GeForce GTX 980	1.41606082916e-05	6.81742568016e-05	4.81435934091
GeForce GTX 1070	8.32125854492e-06	5.98724694252e-05	7.19512187994
GeForce GTX 1080 Ti	1.16311063766e-05	3.62498044968e-05	3.1166256522

Nov 13 '18 09:11 KaiSzuttor

So your tests two weeks ago used double only in some places which was sufficient to make the test case pass but only cost a factor of two in run time? There is no technical reason why we have to use double everywhere, so let's just use them in the critical places.

Nov 13 '18 10:11 mkuron

espresso espresso copied to clipboard

Thermalized LB GPU looses mass

espresso
espresso copied to clipboard