devito first touch is broken: what to do

first touch is currently broken, as it still assumes the Devito2.0 Operator/Propagator.

I think we can drop it, as the solution is to use MPI across sockets.

Aug 11 '17 12:08 FabioLuporini

You're saying two things here:

first touch wasn't updated to use the new API
Do we need it?

Interestingly I discovered this around the same time that you posted while trying to figure out why the checkpointing runs spent so much time executing on a single core.

In case first touch is disabled we use self.data.fill(0) to initialise the array to 0. This seems to be executing on a single core and taking longer than the operator itself. Admittedly, we do need to verify this more rigorously.

If the above is true, I think the solution is to use a "first-touch-like" initialisation using our JIT engine regardless of whether first touch is required or not. Hence, the real issue here would be 1. above and not 2. That the function needs to be updated to use the new API - and we should always be using this to initialise arrays we create.

I will fix that in an upcoming PR.

Aug 11 '17 14:08 navjotk

The following code snippet was tested on Richter:

from time import time
from devito import TimeData

u = TimeData(name='u', shape=(1000, 200, 200, 200))
start = time()
u.data[:]=1
end = time()
print(end-start)

This executed in (38.38, 122.63, 31.75)s

The snippet was modified to this:

from time import time
from devito import TimeData

u = TimeData(name='u', shape=(1000, 200, 200, 200))
start = time()
u.data.fill(1)
end = time()
print(end-start)

Execution times: (30.50, 29.73, 29.89)s

30 seconds to initialise an array should be a cause for immediate concern

Aug 16 '17 16:08 navjotk

And now on the first_touch_again branch, this snippet:

from time import time
from devito import TimeData

u = TimeData(name='u', shape=(1000, 200, 200, 200), first_touch=True)
start = time()
u.data[:]=1
end = time()
print(end-start)

Execution times: (17.33, 17.62, 17.13)s Note that in the above snippet, the line u.data[:]=1 is still run on a single-core so the timings above reflect a "parallel initialisation" followed by a "serial initialisation". Soon we might need to override the [:] operator to replace that too. But for now, I think this shows a non-trivial amount of time being wasted doing single-core initialisation.

Aug 17 '17 17:08 navjotk

With #332 in master, this test can be carried out and reproduced by everyone here on master. Let's continue this conversation? @FabioLuporini @mlange05

Aug 18 '17 14:08 navjotk

Well, what conversation? "first touch is broken" is not longer true, and since this is now hopefully tested sufficiently, we can close this issue. There are a bunch of follow-on questions though:

Parallel scalability across NUMA domains: With FT, can we reliably scale to more than 12 cores on a 24-core/2-domain machine, like richter? Is there a penalty? How does this play with AT?
FT for faster data initialisation: Probably, but should we make it default? For that we need to assess the impact of our current forced re-compilation - and then probably find a clever way of doing generic operator caching, ie. issue #186; or should we cache some dedicated initialisation operators (per shape, maybe)?

Then again, any decent HPC Python distribution, such as Intel's, will have optimized basic numpy operators, so that your np.fill(0.) is just as fast and doesn't need our toolchain. In fact, we probably should benchmark this again with the Intel Python distro if somebody has some time to burn.

As for overriding the [:] operator, I don't think that can be automated, since the : operator is defined by the numpy array that is u.data. We can and should, of course, write complex initialisation expressions as operators wherever we can, but that is pretty much a "best practices" item for users.

Aug 18 '17 15:08 mlange05

30 seconds to initialise an array should be a cause for immediate concern

Actually, that also depends on context, since 30s can be very little or a lot. What's the total runtime of the operator for that data size? For example, is a ~15s speed-up worth it if my overall runtime is 6min?

Aug 18 '17 15:08 mlange05

For example, is a ~15s speed-up worth it if my overall runtime is 6min?

As the numbers above show, that is not the case here. What takes ~ 15s with numpy takes ~ 4 seconds with an Operator.

Aug 18 '17 15:08 navjotk

Not my point. What I mean is, what percentage of overall runtime of my application/PDE operator does the allocation take? A 10x speedup is still useless if the component I speed up only takes, 0.01 of my overall runtime (generically speaking).

Aug 18 '17 15:08 mlange05

what percentage of overall runtime of my application/PDE operator does the allocation take

You have to define "the application" better, for an answer to that. Is it a forward run or a full forward-reverse? Or is it an even bigger adjoint/gradient test? Is it checkpointed? Tell me which one you're interested in and I'll measure it because I'm fairly certain I noticed a big drop in total runtime in the big runs after this.

Aug 18 '17 15:08 navjotk

Ok, let's start with a simple forward-reverse; what's the cost of each operator and the cost of the sequential/parallel initialisation for a given (large) grid size/number of timesteps?

Aug 18 '17 15:08 mlange05

Size: (1264 (time), 260, 260, 260) Best of 3 runs: Gradient test forward run: 39.602s (pure Operator time, as reported by in-built profiler) Reverse run: 40.65s

Double initialisation script as above, with first_touch=True: 31.29s with first_touch=False: 80.97s

Aug 21 '17 14:08 navjotk

First touch is currently broken due to #320

Mar 19 '18 16:03 navjotk

Since we have been discussing #320 a bit lately, and the lack of first touch is proving to be a killer for KNL, it's worth pointing out that first touch is actually broken for what I think may be a deeper reason.

This is the code that does it:

def first_touch(array):                                                                                                                                                                                         │···········································
    """                                                                                                                                                                                                         │···········································
    Uses an Operator to initialize the given array in the same pattern that                                                                                                                                     │···········································
    would later be used to access it.                                                                                                                                                                           │···········································
    """                                                                                                                                                                                                         │···········································
    devito.Operator(Eq(array, 0.))()

and this is an example of such an operator:

In [2]: grid = Grid(shape=(100,100))                                                                                                                                                                            │···········································
                                                                                                                                                                                                                │···········································
In [3]: u = TimeFunction(name='u', grid=grid)                                                                                                                                                                   │···········································
                                                                                                                                                                                                                │···········································
In [4]: Operator(Eq(u,0))                                                                                                                                                                                       │···········································
Out[4]:                                                                                                                                                                                                         │···········································
Function[Kernel]<int; float,int,int,void*,int,int,int,int,int,int>::                                                                                                                                            │···········································
        float (*restrict u)[x_size + 1 + 1][y_size + 1 + 1] __attribute__((aligned(64))) = (float (*)[x_size + 1 + 1][y_size + 1 + 1]) u_vec;                                                                   │···········································
struct profile *timers = (struct profile*) _timers;                                                                                                                                                             │···········································
/* Flush denormal numbers to zero in hardware */                                                                                                                                                                │···········································
_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);                                                                                                                                                             │···········································
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);                                                                                                                                                                     │···········································
for (int time = time_m, t0 = (time)%(1); time <= time_M; time += 1, t0 = (time)%(1))                                                                                                                            │···········································
{                                                                                                                                                                                                               │···········································
  struct timeval start_section_0, end_section_0;                                                                                                                                                                │···········································
  gettimeofday(&start_section_0, NULL);                                                                                                                                                                         │···········································
  for (int x = x_m; x <= x_M; x += 1)                                                                                                                                                                           │···········································
  {                                                                                                                                                                                                             │···········································
    #pragma omp simd                                                                                                                                                                                            │···········································
    for (int y = y_m; y <= y_M; y += 1)                                                                                                                                                                         │···········································
    {                                                                                                                                                                                                           │···········································
      u[t0][x + 1][y + 1] = 0;                                                                                                                                                                                  │···········································
    }                                                                                                                                                                                                           │···········································
  }                                                                                                                                                                                                             │···········································
  gettimeofday(&end_section_0, NULL);                                                                                                                                                                           │···········································
  timers->section_0 += (double)(end_section_0.tv_sec-start_section_0.tv_sec)+(double)(end_section_0.tv_usec-start_section_0.tv_usec)/1000000;                                                                   │···········································
}
In [5]: u.shape                                                                                                                                                                                                 │···········································
Out[5]: (2, 100, 100)

So even if this would compile, e.g. by making the time loop not parallel, there are still at least two issues:

We would run from time_m to time_M to 0 the array, which is obviously way more than necessary in the save=None case
We actually only zero half the array anyway. t0=(time)%(1) is always 0.

Apr 25 '18 14:04 tjb900

the lack of first touch is proving to be a killer for KNL

I should probably elaborate here. There is typically no NUMA locality issue (at least not one resolved by first touch) on KNL. The killer is the low performance of a single thread faulting and zeroing all the pages.

Apr 25 '18 14:04 tjb900

def first_touch(array):
    """
    Uses an Operator to initialize the given array in the same pattern that
    would later be used to access it.
    """
    if array.indices[0].is_Stepping:
        devito.Operator(Eq(array.forward, array))(time_M=array.time_order)
    else:
        devito.Operator(Eq(array, 0.))()

This seems to work fine

May 01 '18 16:05 mloubout

    devito.Operator(Eq(array.forward, array))

Well, that copies the first timestep to the later ones; it presumably still relies on the first timestep being initialised to zero. This may already be guaranteed by some of the allocators. Certainly anything based on mmap() ought to guarantee zeroed memory.

May 02 '18 08:05 tjb900

@tjb900 how do you do this on KNL nowadays ?

Might the following just work?

from devito import *                                                                                                                                                       
grid = Grid(shape=(10, 10, 10))                                                                                                                                                                            f = TimeFunction(name='f', grid=grid)
op = Operator([Eq(f[(i,) + grid.dimensions], 0) for i in range(f.time_order)])
op.apply()

Note that this should/could be provided as a builtin in builtins.py. We already have assign. This seems to be a relaxed version for TimeFunction (relaxed because we zero-out time_order slots, instead of f._time_size)

Can someone try this and, if happy, produce a patch?

Mar 22 '19 10:03 FabioLuporini

@FabioLuporini I've tried this and seems to be working and looks like it is faster than using fill(0.)

Mar 27 '19 12:03 OldCarlosCueto

Thanks! ok then this should somewhat go in master

Mar 27 '19 12:03 FabioLuporini

anybody willing to write a patch?

Apr 16 '19 13:04 FabioLuporini

Can we close this? Doesn't we always want to avoid threaded+NUMA.

Jan 23 '20 01:01 ggorman

more than first touch per se, the issue is "parallel initialization". Do our assign or initialize_function builtins suffice ? do they handle TimeFunction correctly? @rhodrin @mloubout

Jan 24 '20 07:01 FabioLuporini

closing as now irrelevant

Mar 07 '23 10:03 FabioLuporini

devito devito copied to clipboard

first touch is broken: what to do

devito
devito copied to clipboard