dash icon indicating copy to clipboard operation
dash copied to clipboard

Strange performance differences between access to DASH local memory and locally malloc'd memory

Open fuerlinger opened this issue 6 years ago • 17 comments

When investigating the performance of the DASH Cowichan implementation compared to TBB and Cilk I noticed that, strangely, access to DASH local memory (via .lbegin()) appears to be significantly slower than access to regular locally allocated memory via malloc. Here is an example that demonstrates the behavior:

#include <unistd.h>
#include <iostream>
#include <cstddef>
#include <sstream>

#include <libdash.h>

using namespace std;

#include <sys/time.h>
#include <time.h>

#define MYTIMEVAL( tv_ )                        \
  ((tv_.tv_sec)+(tv_.tv_usec)*1.0e-6)

#define TIMESTAMP( time_ )                                              \
  {                                                                     \
    static struct timeval tv;                                           \
    gettimeofday( &tv, NULL );                                          \
    time_=MYTIMEVAL(tv);                                                \
  }

//
// do some work and measure how long it takes
//
double do_work(int *beg, int nelem, int repeat)
{
  const int LCG_A = 1664525, LCG_C = 1013904223;
  
  int seed = 31337;    
  double start, end;

  TIMESTAMP(start);
  for( int j=0; j<repeat; j++ ) {
    for( int i=0; i<nelem; ++i ) {
      seed = LCG_A * seed + LCG_C;
      beg[i] = ((unsigned)seed) %100;
    }
  }
  TIMESTAMP(end);

  return end-start;
}

int main(int argc, char* argv[])
{
  dash::init(&argc, &argv);

  dash::Array<int> arr(100000000);

  int nelem = arr.local.size();
  
  int *mem = (int*) malloc(sizeof(int)*nelem);
  
  double dur1 = do_work(arr.lbegin(), nelem, 1);
  double dur2 = do_work(mem,          nelem, 1);
  
  cerr << "Unit " << dash::myid()
       << " DASH mem: " << dur1 << " secs"
       << " Local mem: " << dur2 << " secs" << endl;
  
  dash::finalize();

  return EXIT_SUCCESS;
}

On my machine, when run with two units, I get the following significant performance differences:

Unit 1 DASH mem: 0.346513 secs Local mem: 0.234078 secs
Unit 0 DASH mem: 0.35398 secs Local mem: 0.232012 secs

The difference appears to vanish if the repeat factor is increased, but this is of no help in the context of the Cowichan problems.

I'm at a loss at the moment about what could be the root cause of this difference. Alignment appears to play no role. @devreal : Any idea what could be behind the difference and what could make access to window memory slower? Maybe different MPI Window creation options? I'll try to investigate with hardware counters in the coming days. NUMA and memory paging are two possible culprits. With NUMA I don't see how it should have an influence in this context, with paging I could imagine that window allocation causes the pages to be pinned and thus changes access characteristics to the memory.

fuerlinger avatar May 14 '18 13:05 fuerlinger

That is a strange thing indeed. I did some testing locally and can confirm this issue. I will try to boil it down to an MPI-only reproducer and if the problem persists contact the (Open) MPI people.

What I tried so far:

  • Pre-initializing the memory using memset (no effect)
  • Looking at the memory alignment (8 Byte using Open MPI, 16 Byte for malloc). Adjusting the alignment of the window memory does not seem to make a difference.
  • Verifying that the compiler generates a single text block and not two functions with one including assumptions on the behavior of malloc

The memory for both allocations is pretty close together, so I am not sure what could make the difference.

@fuerlinger: for reference, which MPI implementation and compiler did you use?

devreal avatar May 14 '18 14:05 devreal

An update for this from the SuperMUC environment with Intel MPI 2018 and ICC 2018:

14 Units, Build Type Release.

Unit 4 DASH mem: 0.130188 secs Local mem: 0.023855 secs
Unit 9 DASH mem: 0.130548 secs Local mem: 0.023509 secs
Unit 5 DASH mem: 0.130306 secs Local mem: 0.023953 secs
Unit 7 DASH mem: 0.130623 secs Local mem: 0.023561 secs
Unit 8 DASH mem: 0.13067 secs Local mem: 0.023586 secs
Unit 2 DASH mem: 0.130733 secs Local mem: 0.023689 secs
Unit 3 DASH mem: 0.130747 secs Local mem: 0.0236819 secs
Unit 6 DASH mem: 0.13038 secs Local mem: 0.024066 secs
Unit 1 DASH mem: 0.130759 secs Local mem: 0.023735 secs
Unit 0 DASH mem: 0.130914 secs Local mem: 0.024004 secs
Unit 10 DASH mem: 0.130967 secs Local mem: 0.024343 secs
Unit 11 DASH mem: 0.130945 secs Local mem: 0.024339 secs
Unit 12 DASH mem: 0.130952 secs Local mem: 0.024312 secs
Unit 13 DASH mem: 0.130967 secs Local mem: 0.0244589 secs

rkowalewski avatar May 14 '18 14:05 rkowalewski

Can you try to have only one process write to the memory? If there are multiple processes writing to the MPI window memory on the same node you might run into false sharing effects as the memory these processes share may be allocated as one chunk (to facilitate shared memory access between the processes running on the same node).

devreal avatar May 14 '18 14:05 devreal

Actually I would not expect too many cache misses / false sharing issues since each process does never access the same memory region. It may be the case nearby boundaries between various processes but that is only a small fraction.

However, it makes indeed a big difference. I modified the example with an outer loop to see the output for reach repetition.

Unit 0 DASH mem: 0.339293 secs Local mem: 0.334556 secs
Unit 0 DASH mem: 0.227669 secs Local mem: 0.228129 secs
Unit 0 DASH mem: 0.227565 secs Local mem: 0.227628 secs
Unit 0 DASH mem: 0.232663 secs Local mem: 0.227732 secs
Unit 0 DASH mem: 0.227623 secs Local mem: 0.227708 secs
Unit 0 DASH mem: 0.227756 secs Local mem: 0.22784 secs
Unit 0 DASH mem: 0.227751 secs Local mem: 0.22782 secs
Unit 0 DASH mem: 0.227683 secs Local mem: 0.228276 secs
Unit 0 DASH mem: 0.227909 secs Local mem: 0.228004 secs
Unit 0 DASH mem: 0.2278 secs Local mem: 0.228169 secs

EDIT: This example is not really representative as it only shows the effects of the first touch policy. In the second iteration everything is in the cache and the results confirm that there are almost no cache misses due to the stable measurements.

rkowalewski avatar May 14 '18 15:05 rkowalewski

Shooting from the hip: maybe the effect we're seeing is caused by the cache associativity? You're right of course that false sharing only occurs on the edges

devreal avatar May 14 '18 15:05 devreal

I modified again the build and disabled MPI shared memory windows. These measurements might reveal that MPI shared memory windows are anything but performant.

Unit 7 DASH mem: 0.023131 secs Local mem: 0.022855 secs
Unit 8 DASH mem: 0.02312 secs Local mem: 0.022841 secs
Unit 9 DASH mem: 0.023099 secs Local mem: 0.022871 secs
Unit 3 DASH mem: 0.023236 secs Local mem: 0.022923 secs
Unit 2 DASH mem: 0.023372 secs Local mem: 0.0230701 secs
Unit 1 DASH mem: 0.023386 secs Local mem: 0.023077 secs
Unit 0 DASH mem: 0.023544 secs Local mem: 0.0231681 secs
Unit 4 DASH mem: 0.023538 secs Local mem: 0.0231829 secs
Unit 5 DASH mem: 0.023564 secs Local mem: 0.0232041 secs
Unit 6 DASH mem: 0.023673 secs Local mem: 0.0232768 secs
Unit 11 DASH mem: 0.023936 secs Local mem: 0.0234411 secs
Unit 12 DASH mem: 0.023897 secs Local mem: 0.0234079 secs
Unit 10 DASH mem: 0.023958 secs Local mem: 0.0235031 secs
Unit 13 DASH mem: 0.024013 secs Local mem: 0.023515 secs

EDIT: I am not sure how shared memory windows are implemented, however, whether it is posix or System V SHMEM, it may cause additional overhead. The "Using Advanced MPI" book from Gropp et al. states that as well.

rkowalewski avatar May 14 '18 15:05 rkowalewski

Nice! Maybe we should disable them by default and print a huge warning that the user is about to shoot himself in the foot if he enables them... It is worth reporting to the MPI folks though.

@fuerlinger Does that solve your performance problem in the cowichan problems?

devreal avatar May 14 '18 15:05 devreal

Maybe we should show a bluescreen of death if the user enables shared memory windows?

Actually, before disabling by default we should run more detailed benchmarks, GUPS and STREAM at least. This should help more to understand the overhead of MPI Shared Memory.

rkowalewski avatar May 14 '18 15:05 rkowalewski

I agree, I will look into that next week (unless someone else wants to volunteer ^^)

devreal avatar May 14 '18 15:05 devreal

After a short hands-on session with @fuerlinger, disabling shared memory windows and enabling dynamic windows solved the problem. However, let us keep this issue open. Just for a reference that we should investigate it in more detail if the time schedule permits it.

rkowalewski avatar May 16 '18 11:05 rkowalewski

Thanks for checking back on this! I'm much interested in this issue and will look at it in more details next week to see how big the impact is on our systems.

devreal avatar May 16 '18 11:05 devreal

It's the same behaviour i noticed in issue #520. With shared windows on, the calculation for the local elements took way longer.

dhinf avatar May 16 '18 11:05 dhinf

Good point, @dhinf! I didn't see the parallels there.

devreal avatar May 16 '18 11:05 devreal

Indeed, disabling shared windows and enabling dynamic windows is required to make the performance differences go away. I see the effect on all MPI implementations I tested, including OpenMPI, MPICH, and Intel MPI.

CFLAGS+=-DDART_MPI_DISABLE_SHARED_WINDOWS CFLAGS+=-DDART_MPI_ENABLE_DYNAMIC_WINDOWS

fuerlinger avatar May 16 '18 18:05 fuerlinger

@fuerlinger What happens if you disable dynamic windows as well (i.e., using regular allocated windows)? Does that trigger the performance issue too?

devreal avatar May 17 '18 06:05 devreal

I did some investigation into this on the MPI level and it appears that there are two things at work here:

  1. Initialization latencies: if memseting the memory prior to the computation (and avoiding NUMA effects, see below), the memory access latencies seem comparable to the local memory access. This is probably caused by mmaping the memory into the process memory space.
  2. NUMA effects: if running on two sockets, I see that performance fluctuates significantly (factor 1.5-10x), even though the memory has been properly initialized.

The current state is that a) avoiding NUMA effects by using a single socket, and b) ensuring proper memory initialization before the measurements ensures comparable performance. That is by no means ideal and I will go ahead and report this to the Open MPI people to see whether this is a known issue.

devreal avatar May 23 '18 08:05 devreal

Short update: After my request on the Open MPI user ML I was able to pinpoint the issue down to the way the shared memory backing the window is allocated. On the Bull Cluster, /tmp is mounted on a disk partition. Incidentally, this is where Open MPI places the shmem backing file in its session directory. Using an alternative session directory in tmpfs or increasing the priority of the POSIX shm implementation solves the problem. I was not able to reproduce the performance problem with Intel MPI on that machine though, things look good there.

@fuerlinger Can you try running with Open MPI and passing the parameter --mca shmem_posix_priority 100 to mpirun on the system you're running on? I'm not sure what causes the performance problems with Intel MPI / MPICH, though. Maybe there are similar options for Intel MPI (although I believe MPICH at least uses POSIX shmem by default)?

devreal avatar May 24 '18 12:05 devreal