devito icon indicating copy to clipboard operation
devito copied to clipboard

first touch is broken: what to do

Open FabioLuporini opened this issue 6 years ago • 22 comments

first touch is currently broken, as it still assumes the Devito2.0 Operator/Propagator.

I think we can drop it, as the solution is to use MPI across sockets.

FabioLuporini avatar Aug 11 '17 12:08 FabioLuporini

You're saying two things here:

  1. first touch wasn't updated to use the new API
  2. Do we need it?

Interestingly I discovered this around the same time that you posted while trying to figure out why the checkpointing runs spent so much time executing on a single core.

In case first touch is disabled we use self.data.fill(0) to initialise the array to 0. This seems to be executing on a single core and taking longer than the operator itself. Admittedly, we do need to verify this more rigorously.

If the above is true, I think the solution is to use a "first-touch-like" initialisation using our JIT engine regardless of whether first touch is required or not. Hence, the real issue here would be 1. above and not 2. That the function needs to be updated to use the new API - and we should always be using this to initialise arrays we create.

I will fix that in an upcoming PR.

navjotk avatar Aug 11 '17 14:08 navjotk

The following code snippet was tested on Richter:

from time import time
from devito import TimeData

u = TimeData(name='u', shape=(1000, 200, 200, 200))
start = time()
u.data[:]=1
end = time()
print(end-start)

This executed in (38.38, 122.63, 31.75)s

The snippet was modified to this:

from time import time
from devito import TimeData

u = TimeData(name='u', shape=(1000, 200, 200, 200))
start = time()
u.data.fill(1)
end = time()
print(end-start)

Execution times: (30.50, 29.73, 29.89)s

30 seconds to initialise an array should be a cause for immediate concern

navjotk avatar Aug 16 '17 16:08 navjotk

And now on the first_touch_again branch, this snippet:

from time import time
from devito import TimeData

u = TimeData(name='u', shape=(1000, 200, 200, 200), first_touch=True)
start = time()
u.data[:]=1
end = time()
print(end-start)

Execution times: (17.33, 17.62, 17.13)s Note that in the above snippet, the line u.data[:]=1 is still run on a single-core so the timings above reflect a "parallel initialisation" followed by a "serial initialisation". Soon we might need to override the [:] operator to replace that too. But for now, I think this shows a non-trivial amount of time being wasted doing single-core initialisation.

navjotk avatar Aug 17 '17 17:08 navjotk

With #332 in master, this test can be carried out and reproduced by everyone here on master. Let's continue this conversation? @FabioLuporini @mlange05

navjotk avatar Aug 18 '17 14:08 navjotk

Well, what conversation? "first touch is broken" is not longer true, and since this is now hopefully tested sufficiently, we can close this issue. There are a bunch of follow-on questions though:

  • Parallel scalability across NUMA domains: With FT, can we reliably scale to more than 12 cores on a 24-core/2-domain machine, like richter? Is there a penalty? How does this play with AT?
  • FT for faster data initialisation: Probably, but should we make it default? For that we need to assess the impact of our current forced re-compilation - and then probably find a clever way of doing generic operator caching, ie. issue #186; or should we cache some dedicated initialisation operators (per shape, maybe)?

Then again, any decent HPC Python distribution, such as Intel's, will have optimized basic numpy operators, so that your np.fill(0.) is just as fast and doesn't need our toolchain. In fact, we probably should benchmark this again with the Intel Python distro if somebody has some time to burn.

As for overriding the [:] operator, I don't think that can be automated, since the : operator is defined by the numpy array that is u.data. We can and should, of course, write complex initialisation expressions as operators wherever we can, but that is pretty much a "best practices" item for users.

mlange05 avatar Aug 18 '17 15:08 mlange05

30 seconds to initialise an array should be a cause for immediate concern

Actually, that also depends on context, since 30s can be very little or a lot. What's the total runtime of the operator for that data size? For example, is a ~15s speed-up worth it if my overall runtime is 6min?

mlange05 avatar Aug 18 '17 15:08 mlange05

For example, is a ~15s speed-up worth it if my overall runtime is 6min?

As the numbers above show, that is not the case here. What takes ~ 15s with numpy takes ~ 4 seconds with an Operator.

navjotk avatar Aug 18 '17 15:08 navjotk

Not my point. What I mean is, what percentage of overall runtime of my application/PDE operator does the allocation take? A 10x speedup is still useless if the component I speed up only takes, 0.01 of my overall runtime (generically speaking).

mlange05 avatar Aug 18 '17 15:08 mlange05

what percentage of overall runtime of my application/PDE operator does the allocation take

You have to define "the application" better, for an answer to that. Is it a forward run or a full forward-reverse? Or is it an even bigger adjoint/gradient test? Is it checkpointed? Tell me which one you're interested in and I'll measure it because I'm fairly certain I noticed a big drop in total runtime in the big runs after this.

navjotk avatar Aug 18 '17 15:08 navjotk

Ok, let's start with a simple forward-reverse; what's the cost of each operator and the cost of the sequential/parallel initialisation for a given (large) grid size/number of timesteps?

mlange05 avatar Aug 18 '17 15:08 mlange05

Size: (1264 (time), 260, 260, 260) Best of 3 runs: Gradient test forward run: 39.602s (pure Operator time, as reported by in-built profiler) Reverse run: 40.65s

Double initialisation script as above, with first_touch=True: 31.29s with first_touch=False: 80.97s

navjotk avatar Aug 21 '17 14:08 navjotk

First touch is currently broken due to #320

navjotk avatar Mar 19 '18 16:03 navjotk

Since we have been discussing #320 a bit lately, and the lack of first touch is proving to be a killer for KNL, it's worth pointing out that first touch is actually broken for what I think may be a deeper reason.

This is the code that does it:

def first_touch(array):                                                                                                                                                                                         │···········································
    """                                                                                                                                                                                                         │···········································
    Uses an Operator to initialize the given array in the same pattern that                                                                                                                                     │···········································
    would later be used to access it.                                                                                                                                                                           │···········································
    """                                                                                                                                                                                                         │···········································
    devito.Operator(Eq(array, 0.))()

and this is an example of such an operator:

In [2]: grid = Grid(shape=(100,100))                                                                                                                                                                            │···········································
                                                                                                                                                                                                                │···········································
In [3]: u = TimeFunction(name='u', grid=grid)                                                                                                                                                                   │···········································
                                                                                                                                                                                                                │···········································
In [4]: Operator(Eq(u,0))                                                                                                                                                                                       │···········································
Out[4]:                                                                                                                                                                                                         │···········································
Function[Kernel]<int; float,int,int,void*,int,int,int,int,int,int>::                                                                                                                                            │···········································
        float (*restrict u)[x_size + 1 + 1][y_size + 1 + 1] __attribute__((aligned(64))) = (float (*)[x_size + 1 + 1][y_size + 1 + 1]) u_vec;                                                                   │···········································
struct profile *timers = (struct profile*) _timers;                                                                                                                                                             │···········································
/* Flush denormal numbers to zero in hardware */                                                                                                                                                                │···········································
_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);                                                                                                                                                             │···········································
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);                                                                                                                                                                     │···········································
for (int time = time_m, t0 = (time)%(1); time <= time_M; time += 1, t0 = (time)%(1))                                                                                                                            │···········································
{                                                                                                                                                                                                               │···········································
  struct timeval start_section_0, end_section_0;                                                                                                                                                                │···········································
  gettimeofday(&start_section_0, NULL);                                                                                                                                                                         │···········································
  for (int x = x_m; x <= x_M; x += 1)                                                                                                                                                                           │···········································
  {                                                                                                                                                                                                             │···········································
    #pragma omp simd                                                                                                                                                                                            │···········································
    for (int y = y_m; y <= y_M; y += 1)                                                                                                                                                                         │···········································
    {                                                                                                                                                                                                           │···········································
      u[t0][x + 1][y + 1] = 0;                                                                                                                                                                                  │···········································
    }                                                                                                                                                                                                           │···········································
  }                                                                                                                                                                                                             │···········································
  gettimeofday(&end_section_0, NULL);                                                                                                                                                                           │···········································
  timers->section_0 += (double)(end_section_0.tv_sec-start_section_0.tv_sec)+(double)(end_section_0.tv_usec-start_section_0.tv_usec)/1000000;                                                                   │···········································
}
In [5]: u.shape                                                                                                                                                                                                 │···········································
Out[5]: (2, 100, 100)

So even if this would compile, e.g. by making the time loop not parallel, there are still at least two issues:

  1. We would run from time_m to time_M to 0 the array, which is obviously way more than necessary in the save=None case
  2. We actually only zero half the array anyway. t0=(time)%(1) is always 0.

tjb900 avatar Apr 25 '18 14:04 tjb900

the lack of first touch is proving to be a killer for KNL

I should probably elaborate here. There is typically no NUMA locality issue (at least not one resolved by first touch) on KNL. The killer is the low performance of a single thread faulting and zeroing all the pages.

tjb900 avatar Apr 25 '18 14:04 tjb900

def first_touch(array):
    """
    Uses an Operator to initialize the given array in the same pattern that
    would later be used to access it.
    """
    if array.indices[0].is_Stepping:
        devito.Operator(Eq(array.forward, array))(time_M=array.time_order)
    else:
        devito.Operator(Eq(array, 0.))()

This seems to work fine

mloubout avatar May 01 '18 16:05 mloubout

    devito.Operator(Eq(array.forward, array))

Well, that copies the first timestep to the later ones; it presumably still relies on the first timestep being initialised to zero. This may already be guaranteed by some of the allocators. Certainly anything based on mmap() ought to guarantee zeroed memory.

tjb900 avatar May 02 '18 08:05 tjb900

@tjb900 how do you do this on KNL nowadays ?

Might the following just work?

from devito import *                                                                                                                                                       
grid = Grid(shape=(10, 10, 10))                                                                                                                                                                            f = TimeFunction(name='f', grid=grid)
op = Operator([Eq(f[(i,) + grid.dimensions], 0) for i in range(f.time_order)])
op.apply()

Note that this should/could be provided as a builtin in builtins.py. We already have assign. This seems to be a relaxed version for TimeFunction (relaxed because we zero-out time_order slots, instead of f._time_size)

Can someone try this and, if happy, produce a patch?

FabioLuporini avatar Mar 22 '19 10:03 FabioLuporini

@FabioLuporini I've tried this and seems to be working and looks like it is faster than using fill(0.)

OldCarlosCueto avatar Mar 27 '19 12:03 OldCarlosCueto

Thanks! ok then this should somewhat go in master

FabioLuporini avatar Mar 27 '19 12:03 FabioLuporini

anybody willing to write a patch?

FabioLuporini avatar Apr 16 '19 13:04 FabioLuporini

Can we close this? Doesn't we always want to avoid threaded+NUMA.

ggorman avatar Jan 23 '20 01:01 ggorman

more than first touch per se, the issue is "parallel initialization". Do our assign or initialize_function builtins suffice ? do they handle TimeFunction correctly? @rhodrin @mloubout

FabioLuporini avatar Jan 24 '20 07:01 FabioLuporini

closing as now irrelevant

FabioLuporini avatar Mar 07 '23 10:03 FabioLuporini