Corrfunc icon indicating copy to clipboard operation
Corrfunc copied to clipboard

Change OpenMP scheduling to from `dynamic` to `static`

Open manodeep opened this issue 3 years ago • 12 comments

General information

  • Corrfunc version: 2.4.0
  • platform: linux
  • installation method (pip/source/other?): from source with python -m pip install -e .

Issue description

OpenMP install with gcc/9.2.0 produces incorrect results when invoked from python. This was originally reported in #197, and then rediscovered in #258

Expected behaviour

The results with OpenMP should be correct when invoked through python

Actual behaviour

The pair counts are nthreads times larger when using the python wrappers. The results are correct when directly using the C static library.

What have you tried so far?

Compiled without OpenMP makes this problem go away

Changing the OpenMP scheduling to static makes the problem less severe - ie, the pair counts are off by a little bit rather than nthreads times. As an added benefit, the code also appears to run marginally faster

Minimal failing example

After installing with the changes in #258, the following fails:

python -m Corrfunc.tests 


manodeep avatar Nov 17 '21 23:11 manodeep

@lgarrison I was thinking of implementing this change asap - what do you think?

manodeep avatar Nov 17 '21 23:11 manodeep

Not sure... I'm worried that static scheduling will hurt the performance for more clustered cases. I'm also not sure "hiding" the problem with static scheduling is a good idea, even as a stop-gap measure, because we don't really understand all the ways this problem can manifest. I think keeping the problem highly visible and obviously wrong, as it is now, might be preferable.

lgarrison avatar Nov 17 '21 23:11 lgarrison

Yeah that's the choice - would it be better to be disastrously wrong or just slightly wrong? Disastrously wrong has a greater chance of people noticing but for those that may not notice, the results will be wrong

I will close this then and solve/detect the race condition as we are progressing on #258

manodeep avatar Nov 17 '21 23:11 manodeep

Speaking of detecting the race condition, maybe it's worth trying the GCC 9.2 thread sanitizer on the Python extension? The memory/address sanitizers were always hard to run because of all the false positives from Python, but perhaps the thread sanitizer will behave better?

lgarrison avatar Nov 17 '21 23:11 lgarrison

Ohh good idea - I will try that later today

manodeep avatar Nov 17 '21 23:11 manodeep

Another idea: maybe worth trying OMP_NESTED=true and OMP_NESTED=false? I don't think we're (intentionally) using any nested parallelism, but if somehow we are, it could mess with the thread IDs...

An even simpler version of this train of thought: can we print the thread IDs and check that they are unique?

lgarrison avatar Nov 23 '21 23:11 lgarrison

I tried omp_set_nested(0) but that did not change anything. Good idea with printing the thread ids - let me try that and I will report back

manodeep avatar Nov 23 '21 23:11 manodeep

Nothing stands out with the printed threads. However, setting OMP_DISPLAY_ENV=verbose does show two OPENMP environments from python but only 1 from C

output from C test -> make tests_bin that passes
======================================

OPENMP DISPLAY ENVIRONMENT BEGIN
  _OPENMP = '201511'
  OMP_DYNAMIC = 'FALSE'
  OMP_NESTED = 'FALSE'
  OMP_NUM_THREADS = '4'
  OMP_SCHEDULE = 'DYNAMIC'
  OMP_PROC_BIND = 'FALSE'
  OMP_PLACES = ''
  OMP_STACKSIZE = '0'
  OMP_WAIT_POLICY = 'PASSIVE'
  OMP_THREAD_LIMIT = '4294967295'
  OMP_MAX_ACTIVE_LEVELS = '2147483647'
  OMP_CANCELLATION = 'FALSE'
  OMP_DEFAULT_DEVICE = '0'
  OMP_MAX_TASK_PRIORITY = '0'
  OMP_DISPLAY_AFFINITY = 'FALSE'
  OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A'
  GOMP_CPU_AFFINITY = ''
  GOMP_STACKSIZE = '0'
  GOMP_SPINCOUNT = '300000'
OPENMP DISPLAY ENVIRONMENT END

output from python test -> python -m Corrfunc.tests that fails
=============================================

OPENMP DISPLAY ENVIRONMENT BEGIN
  _OPENMP = '201511'
  OMP_DYNAMIC = 'FALSE'
  OMP_NESTED = 'FALSE'
  OMP_NUM_THREADS = '4'
  OMP_SCHEDULE = 'DYNAMIC'
  OMP_PROC_BIND = 'FALSE'
  OMP_PLACES = ''
  OMP_STACKSIZE = '0'
  OMP_WAIT_POLICY = 'PASSIVE'
  OMP_THREAD_LIMIT = '4294967295'
  OMP_MAX_ACTIVE_LEVELS = '2147483647'
  OMP_CANCELLATION = 'FALSE'
  OMP_DEFAULT_DEVICE = '0'
  OMP_MAX_TASK_PRIORITY = '0'
  OMP_DISPLAY_AFFINITY = 'FALSE'
  OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A'
  GOMP_CPU_AFFINITY = ''
  GOMP_STACKSIZE = '0'
  GOMP_SPINCOUNT = '300000'
OPENMP DISPLAY ENVIRONMENT END


OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201611'
  [host] KMP_ABORT_DELAY='0'
  [host] KMP_ADAPTIVE_LOCK_PROPS='1,1024'
  [host] KMP_ALIGN_ALLOC='64'
  [host] KMP_ALL_THREADPRIVATE='144'
  [host] KMP_ATOMIC_MODE='2'
  [host] KMP_BLOCKTIME='200'
  [host] KMP_CPUINFO_FILE: value is not defined
  [host] KMP_DETERMINISTIC_REDUCTION='FALSE'
  [host] KMP_DEVICE_THREAD_LIMIT='2147483647'
  [host] KMP_DISP_HAND_THREAD='FALSE'
  [host] KMP_DISP_NUM_BUFFERS='7'
  [host] KMP_DUPLICATE_LIB_OK='FALSE'
  [host] KMP_ENABLE_TASK_THROTTLING='TRUE'
  [host] KMP_FORCE_REDUCTION: value is not defined
  [host] KMP_FOREIGN_THREADS_THREADPRIVATE='TRUE'
  [host] KMP_FORKJOIN_BARRIER='2,2'
  [host] KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
  [host] KMP_FORKJOIN_FRAMES='TRUE'
  [host] KMP_FORKJOIN_FRAMES_MODE='3'
  [host] KMP_GTID_MODE='3'
  [host] KMP_HANDLE_SIGNALS='FALSE'
  [host] KMP_HOT_TEAMS_MAX_LEVEL='1'
  [host] KMP_HOT_TEAMS_MODE='0'
  [host] KMP_INIT_AT_FORK='TRUE'
  [host] KMP_ITT_PREPARE_DELAY='0'
  [host] KMP_LIBRARY='throughput'
  [host] KMP_LOCK_KIND='queuing'
  [host] KMP_MALLOC_POOL_INCR='1M'
  [host] KMP_MWAIT_HINTS='0'
  [host] KMP_NUM_LOCKS_IN_BLOCK='1'
  [host] KMP_PLAIN_BARRIER='2,2'
  [host] KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
  [host] KMP_REDUCTION_BARRIER='1,1'
  [host] KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper'
  [host] KMP_SCHEDULE='static,balanced;guided,iterative'
  [host] KMP_SETTINGS='FALSE'
  [host] KMP_SPIN_BACKOFF_PARAMS='4096,100'
  [host] KMP_STACKOFFSET='64'
  [host] KMP_STACKPAD='0'
  [host] KMP_STACKSIZE='8M'
  [host] KMP_STORAGE_MAP='FALSE'
  [host] KMP_TASKING='2'
  [host] KMP_TASKLOOP_MIN_TASKS='0'
  [host] KMP_TASK_STEALING_CONSTRAINT='1'
  [host] KMP_TEAMS_THREAD_LIMIT='36'
  [host] KMP_TOPOLOGY_METHOD='all'
  [host] KMP_USER_LEVEL_MWAIT='FALSE'
  [host] KMP_USE_YIELD='1'
  [host] KMP_VERSION='FALSE'
  [host] KMP_WARNINGS='TRUE'
  [host] OMP_AFFINITY_FORMAT='OMP: pid %P tid %i thread %n bound to OS proc set {%A}'
  [host] OMP_ALLOCATOR='omp_default_mem_alloc'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEBUG='disabled'
  [host] OMP_DEFAULT_DEVICE='0'
  [host] OMP_DISPLAY_AFFINITY='FALSE'
  [host] OMP_DISPLAY_ENV='VERBOSE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED='FALSE'
  [host] OMP_NUM_THREADS: value is not defined
  [host] OMP_PLACES: value is not defined
  [host] OMP_PROC_BIND='false'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='8M'
  [host] OMP_TARGET_OFFLOAD=DEFAULT
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_TOOL='enabled'
  [host] OMP_TOOL_LIBRARIES: value is not defined
  [host] OMP_WAIT_POLICY='PASSIVE'
  [host] KMP_AFFINITY='noverbose,warnings,respect,granularity=core,none'
OPENMP DISPLAY ENVIRONMENT END

Another weird thing that I noted is that even though the make tests_bin and the python -m Corrfunc.tests have the same number of bins, the total number of cell-pairs created is different -- 359833 cell pairs when called from python, and 360051 from C (through make). This could be why the static scheduling only fails mildly, rather than spectacularly as with dynamic scheduling.

I have also discovered that there is a race condition with numdone that results in numdone > num_cell_pairs. This race condition does not change even if I change the numdone update from atomic to critical.

JEEZ

manodeep avatar Nov 24 '21 03:11 manodeep

Another weird thing that I noted is that even though the make tests_bin and the python -m Corrfunc.tests have the same number of bins, the total number of cell-pairs created is different -- 359833 cell pairs when called from python, and 360051 from C (through make). This could be why the static scheduling only fails mildly, rather than spectacularly as with dynamic scheduling.

Solved this one - the C tests do not enable_min_sep_opt whereas the python tests do. When I set options.enable_min_sep_opt = 1 within test_periodic_lin.c, then I get the 359833 cell-pairs.

manodeep avatar Nov 24 '21 04:11 manodeep

I did some further reading (can't find the reference - the name looked familiar - so someone in the open-source python ecosystem) and it is possible that the OpenMP issues are occurring because OpenMP is linked while the static library is generated (i.e., libcountpairs.a etc) and those static libraries are then again linked into the dynamic C extension shared library. Might be worthwhile to create a dynamic library for the Corrfunc pair-counters in the first place (or use a dynamic library when linking to the C extensions).

manodeep avatar Jun 14 '23 12:06 manodeep

Creating shared libraries (instead of static libraries) did not resolve the issue of multiple OpenMP runtime libraries (libgomp from compiling with GCC and libiomp5 from mkl within numpy). However, setting "export MKL_THREADING_LAYER=GNU" resolved the problem entirely.

Source - https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md (This isn't the article I was referring to in the previous comment)

manodeep avatar Aug 30 '23 12:08 manodeep

Setting that environment variable with os.environ does not work. Though I am not sure I understand why that is, since the documentation says all child processes should inherit the new environment variable. I added a small snippet of code with DD to test whether the OMP runtime environment was working correctly:

#if defined(_OPENMP)
#include <omp.h>

int test_omp_duplicate_runtime_lib_DOUBLE(const int64_t N);

int test_omp_duplicate_runtime_lib_DOUBLE(const int64_t N)
{
    int64_t correct_sum=0.0;
    for(int64_t i=0;i<N;i++) {
        correct_sum += i;
    }

    int64_t omp_sum=0.0;
#pragma omp parallel for schedule(dynamic) reduction(+:omp_sum)
    for(int64_t i=0;i<N;i++) {
        omp_sum += i;
    }
    if(omp_sum != correct_sum) {
        fprintf(stderr,"Basic OMP reduction is causing errors. Expected sum = %"PRId64" sum with OpenMP + reduction = %"PRId64"\n",
                correct_sum, omp_sum);
        return EXIT_FAILURE;
    }
    fprintf(stderr,"Correct result returned\n");
    return EXIT_SUCCESS;
}
#endif

For the case with duplicate OMP libraries, this test code errors with schedule(dynamic) but seems to work fine with schedule(static).

Separately, when setting MKL_THREADING_LAYER=GNU outside within the terminal before invoking Corrfunc (python call_correlation_functions.py) still seems to produce slightly incorrect (but not wildly incorrect) results for the final 2-3 bins.

manodeep avatar Aug 31 '23 23:08 manodeep