espresso
espresso copied to clipboard
Thermalized LB GPU looses mass
Density vs time:
1010.0 0.999999602675 0.000322047627491
2010.0 0.99999923265 0.000309079308688
3010.00000002 0.999998866677 0.000354693064748
4010.00000004 0.999998498201 0.000269992543406
[...]
243010.000074 0.999889917135 0.000340308489128
244010.000075 0.999889334023 0.00027192450585
245010.000076 0.999888790131 0.000263980095021
246010.000077 0.999888206244 0.00035571453468
247010.000078 0.999887622297 0.000339340864591
248010.000079 0.999887059152 0.000274875165011
249010.00008 0.99988651675 0.000318489413579
250010.000081 0.999885953069 0.000288159735111
Since single precision accuracy is ~10E-7, this may or may not be due to numerics.
The script was (with time step 0.01):
import espressomd
from espressomd.lb import LBFluidGPU, LBFluid
import numpy as np
import sys
from time import time
l=10
s=espressomd.System(box_l=[l,l,l])
s.cell_system.skin=0.01
s.set_random_state_PRNG()
#s.part.add(pos=np.random.random((100,3)))
dt=float(sys.argv[1])
lb_class=None
if sys.argv[2]=="cpu":
lb_class=LBFluid
elif sys.argv[2]=="gpu":
lb_class=LBFluidGPU
else:
raise Exception()
s.time_step=dt
lbf =lb_class(agrid=1,visc=1,dens=1,tau=dt,fric=1)
s.thermostat.set_lb(kT=1)
s.actors.add(lbf)
s.integrator.run(1000)
nodes=[]
for i in range(10):
for j in range(10):
for k in range(10):
nodes.append(lbf[i,j,k])
n_nodes=len(nodes)
for i in range(250):
s.integrator.run(100000)
v=[]
for n in nodes:
v.append(n.density[0])
print s.time, np.mean(v),np.var(v)
The script was:
import espressomd
from espressomd.lb import LBFluidGPU, LBFluid
import numpy as np
import sys
from time import time
l=10
s=espressomd.System(box_l=[l,l,l])
s.cell_system.skin=0.01
s.set_random_state_PRNG()
#s.part.add(pos=np.random.random((100,3)))
dt=float(sys.argv[1])
lb_class=None
if sys.argv[2]=="cpu":
lb_class=LBFluid
elif sys.argv[2]=="gpu":
lb_class=LBFluidGPU
else:
raise Exception()
s.time_step=dt
lbf =lb_class(agrid=1,visc=1,dens=1,tau=dt,fric=1)
s.thermostat.set_lb(kT=1)
s.actors.add(lbf)
s.integrator.run(1000)
nodes=[]
for i in range(10):
for j in range(10):
for k in range(10):
nodes.append(lbf[i,j,k])
n_nodes=len(nodes)
for i in range(250):
s.integrator.run(100000)
v=[]
for n in nodes:
v.append(n.density[0])
print s.time, np.mean(v),np.var(v)
has to be checked after #2321 is closed.
is due to floating point precision. We will make the GPU implementation use double.
@KaiSzuttor, please provide before/after numbers on different GPU generations. Here's a list of GPUs we have and some candidate computers: GTX 680 (wisent, impala), Titan Black (bee), GTX 780 Ti (gaur, tahr), GTX 980 (chiru, gnu), GTX 1070 (wapiti), GTX 1080 Ti (cipXX), Vega 64 (lama, requires recompiling).
The 600 and 700 series had 1/24th double precision performance, the 900 and 1000 series has 1/32th. The Titan Black has 1/3rd. If we are memory-bound and not compute-bound, we'd expect 1/2.
Does the temperature fluctuation width issue also go away with double precision?
The places where a lot of summing takes place are mainly the mode<->population transformations. Maybe a Kahan summation in a few strategic places could allow us to continue with single precision.
| GPU | Timing float | Timing double | ratio float/double |
|---|---|---|---|
| GeForce GTX 780 Ti | 6.8347299099e-05 | 0.000103395255566 | 0.6610293550207637 |
| GeForce GTX 980 | 5.94442596436e-05 | 9.95463490486e-05 | 0.5971515802611549 |
| GeForce GTX 1070 | 7.30076231956e-05 | 0.000126591158867 | 0.5767197634418034 |
| GeForce GTX 1080 Ti | 3.57395396233e-05 | 5.6533952713e-05 | 0.6321783266196719 |
That seems not too bad.
On short timescale (~month), I personally think that going to double on the GPU is the only fix I personally can contribute.
Script used:
from __future__ import print_function
import numpy as np
import time
import espressomd
import espressomd.lb
N_SAMPLES = 50
N_STEPS = 10000
AGRID = 0.5
DENS = 1.0
VISC = 1.0
FRIC = 1.0
TAU = 0.01
LB_PARAMETERS = {'agrid': AGRID,
'dens': DENS,
'visc': VISC,
'fric': FRIC,
'tau': TAU}
system = espressomd.System(box_l = [10.0]*3)
system.time_step = TAU
system.cell_system.skin = 0.4 * AGRID
lbf = espressomd.lb.LBFluidGPU(**LB_PARAMETERS)
system.actors.add(lbf)
system.thermostat.set_lb(kT=1.0)
timings = np.zeros(N_SAMPLES)
for i in range(N_SAMPLES):
start = time.time()
system.integrator.run(N_STEPS)
stop = time.time()
timings[i] = (stop - start) / float(N_STEPS)
print(np.mean(timings), np.var(timings))
~~I haven't checked for the temperature fluctuation yet.~~
Using doubles, the lb.py testcase passes when using the same precision requirements for CPU/GPU LB.
We would need to essentially replace float with a custom Kahan-summed datatype so we can keep the correction term for the entirety of the simulation. If we only did Kahan summation inside the mode transformation and then convert back to a plain float, we would only gain 1-2 orders of magnitude in precision, which is not enough.
Update: after eliminating the leftover floats, the situations look quite different
| GPU | Timing float | Timing double | ratio double/float |
|---|---|---|---|
| GeForce GTX 680 | 1.78741159439e-05 | 9.24656462669e-05 | 5.173159140128341 |
| GeForce GTX 780 Ti | 1.5184492588e-05 | 6.42595734596e-05 | 4.23192102648 |
| GeForce GTX 980 | 1.41606082916e-05 | 6.81742568016e-05 | 4.81435934091 |
| GeForce GTX 1070 | 8.32125854492e-06 | 5.98724694252e-05 | 7.19512187994 |
| GeForce GTX 1080 Ti | 1.16311063766e-05 | 3.62498044968e-05 | 3.1166256522 |
So your tests two weeks ago used double only in some places which was sufficient to make the test case pass but only cost a factor of two in run time? There is no technical reason why we have to use double everywhere, so let's just use them in the critical places.