CoreNeuron icon indicating copy to clipboard operation
CoreNeuron copied to clipboard

GPU-enabled builds should not call GPU APIs without --gpu

Open olupton opened this issue 3 years ago • 5 comments

#595 closes #345, because GPU-enabled builds using Random123 can now be executed on CPU.

When a GPU-enabled build is executed without --gpu but on a machine that does have a GPU, various data transfers and synchronisation calls are made, which can be seen e.g. with nvprof.

We should fix these and make sure all GPU/OpenACC operations are conditional on nt->compute_gpu.

olupton avatar Jul 23 '21 10:07 olupton

A more problematic aspect of this is that __managed__ is not safe to use on nodes without GPUs, as demonstrated by

#include <iostream>
__managed__ int x = 42;
int main() {
  std::cout << "x = " << x << std::endl;
  return 0;
}

which will segfault on a non-GPU node and print 42 on a node with an NVIDIA GPU. The real-world example of this is

(gdb) bt
#0  0x0000000000570a65 in coreneuron::nrnran123_set_globalindex(unsigned int) ()
#1  0x0000000000496a8d in coreneuron::set_globals (path=<optimized out>, cli_global_seed=0 '\000', cli_global_seed_value=-1) at ../external/coreneuron/coreneuron/io/global_vars.cpp:134
#2  0x0000000000487c2a in coreneuron::nrn_init_and_load_data (argc=<optimized out>, argv=<optimized out>, checkPoints=0x7fffffffd7c0, is_mapping_needed=0 '\000', run_setup_cleanup=1 '\001')
    at ../external/coreneuron/coreneuron/apps/main1.cpp:210
#3  0x000000000046e68c in run_solve_core (argc=<optimized out>, argv=<optimized out>) at ../external/coreneuron/coreneuron/apps/main1.cpp:496
#4  0x000000000040eac0 in solve_core (argc=3, argv=0x7fffffffda58) at /gpfs/bbp.cscs.ch/home/olupton/nrn/build_ollimod2c/share/coreneuron/enginemech.cpp:49
#5  0x000000000040ea99 in main (argc=10867584, argv=0x57f240 <cudart::createGlobalState()>) at /gpfs/bbp.cscs.ch/home/olupton/nrn/build_ollimod2c/share/coreneuron/coreneuron.cpp:14

so a follow-up to https://github.com/BlueBrain/CoreNeuron/pull/595 will be needed to enable non-GPU execution in GPU builds on nodes without GPUs.

olupton avatar Jul 28 '21 15:07 olupton

Checked right now, all #pragma acc that support if clause (following specification of openacc) have one.

alkino avatar Oct 13 '21 14:10 alkino

Just to comment: one can check where GPU calls still happening using NVCOMPILER_ACC_TIME=1 i.e. in the absence of --gpu argument we don't expect any GPU activity. For example, running test/coreneuron/test_direct.py without GPU gives:

NVCOMPILER_ACC_TIME=1 CORENRN_ENABLE_GPU=0 ./x86_64/special -python test/coreneuron/test_direct.py
INFO : Using neuron-gpu-nightly Package (Alpha Developer Version)

NEURON -- VERSION 8.0a-721-g501beb599 epic/gpu_wheel (501beb599) 2021-10-22
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2021
See http://neuron.yale.edu/neuron/credits

Additional mechanisms from files
 "test/coreneuron/mod//fornetcon.mod" "test/coreneuron/mod//invlfire.mod" "test/coreneuron/mod//netmove.mod" "test/coreneuron/mod//sample.mod" "test/coreneuron/mod//watchrange.mod"
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
CUPTI ERROR: cuptiActivityEnable(CUPTI_ACTIVITY_KIND_KERNEL) returned: CUPTI_ERROR_INSUFFICIENT_PRIVILEGES,
	 at ../../src-cupti/prof_cuda_cupti.c:297.
in test_hoc_event() at t=0
in test_hoc_event() at t=1
in test_hoc_event() at t=2
in test_hoc_event() at t=0
WARNING: CVode.event(...) for delivery at time step nearest 0.99375 discarded. CoreNEURON cannot presently handle interpreter events (rank 0, thread 0).

Accelerator Kernel Timing data
/home/kumbhar/nrn/x86_64/corenrn/mod2c/netmove.cpp
  _ZN54_INTERNAL_32_x86_64_corenrn_mod2c_netmove_cpp_39acfb1d10coreneuron10_initlistsEv  NVIDIA  devicenum=0
    time(us): 15
    678: data region reached 2 times
        678: data copyin transfers: 4
             device time(us): total=15 max=5 min=3 avg=3
/home/kumbhar/nrn/x86_64/corenrn/mod2c/hh.cpp
  _ZN49_INTERNAL_27_x86_64_corenrn_mod2c_hh_cpp_4898369610coreneuron19_acc_globals_updateEv  NVIDIA  devicenum=0
    time(us): 85
    192: update directive reached 13 times
        192: data copyin transfers: 13
             device time(us): total=85 max=9 min=6 avg=6
/home/kumbhar/nrn/x86_64/corenrn/mod2c/hh.cpp
  _ZN49_INTERNAL_27_x86_64_corenrn_mod2c_hh_cpp_4898369610coreneuron10_initlistsEv  NVIDIA  devicenum=0
    time(us): 17
    618: data region reached 2 times
        618: data copyin transfers: 4
             device time(us): total=17 max=5 min=4 avg=4
/home/kumbhar/nrn/x86_64/corenrn/mod2c/expsyn.cpp
  _ZN53_INTERNAL_31_x86_64_corenrn_mod2c_expsyn_cpp_0127273610coreneuron10_initlistsEv  NVIDIA  devicenum=0
    time(us): 15
    606: data region reached 2 times
        606: data copyin transfers: 4
             device time(us): total=15 max=5 min=3 avg=3
/home/kumbhar/nrn/x86_64/corenrn/mod2c/exp2syn.cpp
  _ZN54_INTERNAL_32_x86_64_corenrn_mod2c_exp2syn_cpp_25c7f1d810coreneuron10_initlistsEv  NVIDIA  devicenum=0
    time(us): 91
    639: data region reached 2 times
        639: data copyin transfers: 4
             device time(us): total=18 max=6 min=4 avg=4
    639: upload reached 16 times
        639: data copyin transfers: 16
             device time(us): total=73 max=12 min=4 avg=4

We shouldn't have Accelerator Kernel Timing data. So we should check for missing if clauses.

pramodk avatar Oct 22 '21 06:10 pramodk

A more problematic aspect of this is that managed is not safe to use on nodes without GPUs, as demonstrated by so a follow-up to #595 will be needed to enable non-GPU execution in GPU builds on nodes without GPUs.

@olupton : do we have any usage of __managed__ now? i.e. is there anything still needed for the above?

kumbhar@bbd-fp8lt73:~/nrn$ grep -r "__managed__" external/coreneuron/*
external/coreneuron/coreneuron/utils/randoms/nrnran123.cu:// __managed__ here, but unfortunately that does not work on machines that do

pramodk avatar Oct 22 '21 06:10 pramodk

I think we removed all usage of __managed__.

When I last checked, it seemed that the cases left are:

  • #pragma acc wait, which doesn't support an if clause (but we can just do if(compute_gpu) { _Pragma("acc wait") })
  • Clauses in initialisation methods and for global variables, where there is no NrnThread::compute_gpu available

(I did not check that carefully, so there may be other cases too)

olupton avatar Oct 22 '21 07:10 olupton

I think https://github.com/BlueBrain/CoreNeuron/pull/795 et al. improve the situation here, I just checked in an NMODL + OpenACC build and did not see any GPU activity in non-GPU tests.

olupton avatar Aug 25 '22 07:08 olupton

I see the same in a MOD2C build with those branches, so I will mark this issue as closed by #795.

olupton avatar Aug 25 '22 08:08 olupton