devito
devito copied to clipboard
first touch is broken: what to do
first touch is currently broken, as it still assumes the Devito2.0 Operator/Propagator.
I think we can drop it, as the solution is to use MPI across sockets.
You're saying two things here:
- first touch wasn't updated to use the new API
- Do we need it?
Interestingly I discovered this around the same time that you posted while trying to figure out why the checkpointing runs spent so much time executing on a single core.
In case first touch is disabled we use self.data.fill(0)
to initialise the array to 0. This seems to be executing on a single core and taking longer than the operator itself. Admittedly, we do need to verify this more rigorously.
If the above is true, I think the solution is to use a "first-touch-like" initialisation using our JIT engine regardless of whether first touch is required or not. Hence, the real issue here would be 1. above and not 2. That the function needs to be updated to use the new API - and we should always be using this to initialise arrays we create.
I will fix that in an upcoming PR.
The following code snippet was tested on Richter:
from time import time
from devito import TimeData
u = TimeData(name='u', shape=(1000, 200, 200, 200))
start = time()
u.data[:]=1
end = time()
print(end-start)
This executed in (38.38, 122.63, 31.75)s
The snippet was modified to this:
from time import time
from devito import TimeData
u = TimeData(name='u', shape=(1000, 200, 200, 200))
start = time()
u.data.fill(1)
end = time()
print(end-start)
Execution times: (30.50, 29.73, 29.89)s
30 seconds to initialise an array should be a cause for immediate concern
And now on the first_touch_again
branch, this snippet:
from time import time
from devito import TimeData
u = TimeData(name='u', shape=(1000, 200, 200, 200), first_touch=True)
start = time()
u.data[:]=1
end = time()
print(end-start)
Execution times: (17.33, 17.62, 17.13)s
Note that in the above snippet, the line u.data[:]=1
is still run on a single-core so the timings above reflect a "parallel initialisation" followed by a "serial initialisation". Soon we might need to override the [:]
operator to replace that too. But for now, I think this shows a non-trivial amount of time being wasted doing single-core initialisation.
With #332 in master, this test can be carried out and reproduced by everyone here on master. Let's continue this conversation? @FabioLuporini @mlange05
Well, what conversation? "first touch is broken" is not longer true, and since this is now hopefully tested sufficiently, we can close this issue. There are a bunch of follow-on questions though:
- Parallel scalability across NUMA domains: With FT, can we reliably scale to more than 12 cores on a 24-core/2-domain machine, like
richter
? Is there a penalty? How does this play with AT? - FT for faster data initialisation: Probably, but should we make it default? For that we need to assess the impact of our current forced re-compilation - and then probably find a clever way of doing generic operator caching, ie. issue #186; or should we cache some dedicated initialisation operators (per shape, maybe)?
Then again, any decent HPC Python distribution, such as Intel's, will have optimized basic numpy operators, so that your np.fill(0.)
is just as fast and doesn't need our toolchain. In fact, we probably should benchmark this again with the Intel Python distro if somebody has some time to burn.
As for overriding the [:]
operator, I don't think that can be automated, since the :
operator is defined by the numpy array that is u.data
. We can and should, of course, write complex initialisation expressions as operators wherever we can, but that is pretty much a "best practices" item for users.
30 seconds to initialise an array should be a cause for immediate concern
Actually, that also depends on context, since 30s can be very little or a lot. What's the total runtime of the operator for that data size? For example, is a ~15s speed-up worth it if my overall runtime is 6min?
For example, is a ~15s speed-up worth it if my overall runtime is 6min?
As the numbers above show, that is not the case here. What takes ~ 15s with numpy takes ~ 4 seconds with an Operator
.
Not my point. What I mean is, what percentage of overall runtime of my application/PDE operator does the allocation take? A 10x speedup is still useless if the component I speed up only takes, 0.01 of my overall runtime (generically speaking).
what percentage of overall runtime of my application/PDE operator does the allocation take
You have to define "the application" better, for an answer to that. Is it a forward run or a full forward-reverse? Or is it an even bigger adjoint/gradient test? Is it checkpointed? Tell me which one you're interested in and I'll measure it because I'm fairly certain I noticed a big drop in total runtime in the big runs after this.
Ok, let's start with a simple forward-reverse; what's the cost of each operator and the cost of the sequential/parallel initialisation for a given (large) grid size/number of timesteps?
Size: (1264 (time), 260, 260, 260) Best of 3 runs: Gradient test forward run: 39.602s (pure Operator time, as reported by in-built profiler) Reverse run: 40.65s
Double initialisation script as above, with first_touch=True
: 31.29s
with first_touch=False
: 80.97s
First touch is currently broken due to #320
Since we have been discussing #320 a bit lately, and the lack of first touch is proving to be a killer for KNL, it's worth pointing out that first touch is actually broken for what I think may be a deeper reason.
This is the code that does it:
def first_touch(array): │···········································
""" │···········································
Uses an Operator to initialize the given array in the same pattern that │···········································
would later be used to access it. │···········································
""" │···········································
devito.Operator(Eq(array, 0.))()
and this is an example of such an operator:
In [2]: grid = Grid(shape=(100,100)) │···········································
│···········································
In [3]: u = TimeFunction(name='u', grid=grid) │···········································
│···········································
In [4]: Operator(Eq(u,0)) │···········································
Out[4]: │···········································
Function[Kernel]<int; float,int,int,void*,int,int,int,int,int,int>:: │···········································
float (*restrict u)[x_size + 1 + 1][y_size + 1 + 1] __attribute__((aligned(64))) = (float (*)[x_size + 1 + 1][y_size + 1 + 1]) u_vec; │···········································
struct profile *timers = (struct profile*) _timers; │···········································
/* Flush denormal numbers to zero in hardware */ │···········································
_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON); │···········································
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); │···········································
for (int time = time_m, t0 = (time)%(1); time <= time_M; time += 1, t0 = (time)%(1)) │···········································
{ │···········································
struct timeval start_section_0, end_section_0; │···········································
gettimeofday(&start_section_0, NULL); │···········································
for (int x = x_m; x <= x_M; x += 1) │···········································
{ │···········································
#pragma omp simd │···········································
for (int y = y_m; y <= y_M; y += 1) │···········································
{ │···········································
u[t0][x + 1][y + 1] = 0; │···········································
} │···········································
} │···········································
gettimeofday(&end_section_0, NULL); │···········································
timers->section_0 += (double)(end_section_0.tv_sec-start_section_0.tv_sec)+(double)(end_section_0.tv_usec-start_section_0.tv_usec)/1000000; │···········································
}
In [5]: u.shape │···········································
Out[5]: (2, 100, 100)
So even if this would compile, e.g. by making the time loop not parallel, there are still at least two issues:
- We would run from
time_m
totime_M
to 0 the array, which is obviously way more than necessary in thesave=None
case - We actually only zero half the array anyway.
t0=(time)%(1)
is always 0.
the lack of first touch is proving to be a killer for KNL
I should probably elaborate here. There is typically no NUMA locality issue (at least not one resolved by first touch) on KNL. The killer is the low performance of a single thread faulting and zeroing all the pages.
def first_touch(array):
"""
Uses an Operator to initialize the given array in the same pattern that
would later be used to access it.
"""
if array.indices[0].is_Stepping:
devito.Operator(Eq(array.forward, array))(time_M=array.time_order)
else:
devito.Operator(Eq(array, 0.))()
This seems to work fine
devito.Operator(Eq(array.forward, array))
Well, that copies the first timestep to the later ones; it presumably still relies on the first timestep being initialised to zero. This may already be guaranteed by some of the allocators. Certainly anything based on mmap()
ought to guarantee zeroed memory.
@tjb900 how do you do this on KNL nowadays ?
Might the following just work?
from devito import *
grid = Grid(shape=(10, 10, 10)) f = TimeFunction(name='f', grid=grid)
op = Operator([Eq(f[(i,) + grid.dimensions], 0) for i in range(f.time_order)])
op.apply()
Note that this should/could be provided as a builtin in builtins.py
. We already have assign
. This seems to be a relaxed version for TimeFunction
(relaxed because we zero-out time_order
slots, instead of f._time_size
)
Can someone try this and, if happy, produce a patch?
@FabioLuporini I've tried this and seems to be working and looks like it is faster than using fill(0.)
Thanks! ok then this should somewhat go in master
anybody willing to write a patch?
Can we close this? Doesn't we always want to avoid threaded+NUMA.
more than first touch per se, the issue is "parallel initialization". Do our assign
or initialize_function
builtins suffice ? do they handle TimeFunction
correctly? @rhodrin @mloubout
closing as now irrelevant