CoreNeuron
CoreNeuron copied to clipboard
GPU-enabled builds should not call GPU APIs without --gpu
#595 closes #345, because GPU-enabled builds using Random123 can now be executed on CPU.
When a GPU-enabled build is executed without --gpu
but on a machine that does have a GPU, various data transfers and synchronisation calls are made, which can be seen e.g. with nvprof
.
We should fix these and make sure all GPU/OpenACC operations are conditional on nt->compute_gpu
.
A more problematic aspect of this is that __managed__
is not safe to use on nodes without GPUs, as demonstrated by
#include <iostream>
__managed__ int x = 42;
int main() {
std::cout << "x = " << x << std::endl;
return 0;
}
which will segfault on a non-GPU node and print 42 on a node with an NVIDIA GPU. The real-world example of this is
(gdb) bt
#0 0x0000000000570a65 in coreneuron::nrnran123_set_globalindex(unsigned int) ()
#1 0x0000000000496a8d in coreneuron::set_globals (path=<optimized out>, cli_global_seed=0 '\000', cli_global_seed_value=-1) at ../external/coreneuron/coreneuron/io/global_vars.cpp:134
#2 0x0000000000487c2a in coreneuron::nrn_init_and_load_data (argc=<optimized out>, argv=<optimized out>, checkPoints=0x7fffffffd7c0, is_mapping_needed=0 '\000', run_setup_cleanup=1 '\001')
at ../external/coreneuron/coreneuron/apps/main1.cpp:210
#3 0x000000000046e68c in run_solve_core (argc=<optimized out>, argv=<optimized out>) at ../external/coreneuron/coreneuron/apps/main1.cpp:496
#4 0x000000000040eac0 in solve_core (argc=3, argv=0x7fffffffda58) at /gpfs/bbp.cscs.ch/home/olupton/nrn/build_ollimod2c/share/coreneuron/enginemech.cpp:49
#5 0x000000000040ea99 in main (argc=10867584, argv=0x57f240 <cudart::createGlobalState()>) at /gpfs/bbp.cscs.ch/home/olupton/nrn/build_ollimod2c/share/coreneuron/coreneuron.cpp:14
so a follow-up to https://github.com/BlueBrain/CoreNeuron/pull/595 will be needed to enable non-GPU execution in GPU builds on nodes without GPUs.
Checked right now, all #pragma acc
that support if
clause (following specification of openacc) have one.
Just to comment: one can check where GPU calls still happening using NVCOMPILER_ACC_TIME=1
i.e. in the absence of --gpu
argument we don't expect any GPU activity. For example, running test/coreneuron/test_direct.py
without GPU gives:
NVCOMPILER_ACC_TIME=1 CORENRN_ENABLE_GPU=0 ./x86_64/special -python test/coreneuron/test_direct.py
INFO : Using neuron-gpu-nightly Package (Alpha Developer Version)
NEURON -- VERSION 8.0a-721-g501beb599 epic/gpu_wheel (501beb599) 2021-10-22
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2021
See http://neuron.yale.edu/neuron/credits
Additional mechanisms from files
"test/coreneuron/mod//fornetcon.mod" "test/coreneuron/mod//invlfire.mod" "test/coreneuron/mod//netmove.mod" "test/coreneuron/mod//sample.mod" "test/coreneuron/mod//watchrange.mod"
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
CUPTI ERROR: cuptiActivityEnable(CUPTI_ACTIVITY_KIND_KERNEL) returned: CUPTI_ERROR_INSUFFICIENT_PRIVILEGES,
at ../../src-cupti/prof_cuda_cupti.c:297.
in test_hoc_event() at t=0
in test_hoc_event() at t=1
in test_hoc_event() at t=2
in test_hoc_event() at t=0
WARNING: CVode.event(...) for delivery at time step nearest 0.99375 discarded. CoreNEURON cannot presently handle interpreter events (rank 0, thread 0).
Accelerator Kernel Timing data
/home/kumbhar/nrn/x86_64/corenrn/mod2c/netmove.cpp
_ZN54_INTERNAL_32_x86_64_corenrn_mod2c_netmove_cpp_39acfb1d10coreneuron10_initlistsEv NVIDIA devicenum=0
time(us): 15
678: data region reached 2 times
678: data copyin transfers: 4
device time(us): total=15 max=5 min=3 avg=3
/home/kumbhar/nrn/x86_64/corenrn/mod2c/hh.cpp
_ZN49_INTERNAL_27_x86_64_corenrn_mod2c_hh_cpp_4898369610coreneuron19_acc_globals_updateEv NVIDIA devicenum=0
time(us): 85
192: update directive reached 13 times
192: data copyin transfers: 13
device time(us): total=85 max=9 min=6 avg=6
/home/kumbhar/nrn/x86_64/corenrn/mod2c/hh.cpp
_ZN49_INTERNAL_27_x86_64_corenrn_mod2c_hh_cpp_4898369610coreneuron10_initlistsEv NVIDIA devicenum=0
time(us): 17
618: data region reached 2 times
618: data copyin transfers: 4
device time(us): total=17 max=5 min=4 avg=4
/home/kumbhar/nrn/x86_64/corenrn/mod2c/expsyn.cpp
_ZN53_INTERNAL_31_x86_64_corenrn_mod2c_expsyn_cpp_0127273610coreneuron10_initlistsEv NVIDIA devicenum=0
time(us): 15
606: data region reached 2 times
606: data copyin transfers: 4
device time(us): total=15 max=5 min=3 avg=3
/home/kumbhar/nrn/x86_64/corenrn/mod2c/exp2syn.cpp
_ZN54_INTERNAL_32_x86_64_corenrn_mod2c_exp2syn_cpp_25c7f1d810coreneuron10_initlistsEv NVIDIA devicenum=0
time(us): 91
639: data region reached 2 times
639: data copyin transfers: 4
device time(us): total=18 max=6 min=4 avg=4
639: upload reached 16 times
639: data copyin transfers: 16
device time(us): total=73 max=12 min=4 avg=4
We shouldn't have Accelerator Kernel Timing data
. So we should check for missing if
clauses.
A more problematic aspect of this is that managed is not safe to use on nodes without GPUs, as demonstrated by so a follow-up to #595 will be needed to enable non-GPU execution in GPU builds on nodes without GPUs.
@olupton : do we have any usage of __managed__
now? i.e. is there anything still needed for the above?
kumbhar@bbd-fp8lt73:~/nrn$ grep -r "__managed__" external/coreneuron/*
external/coreneuron/coreneuron/utils/randoms/nrnran123.cu:// __managed__ here, but unfortunately that does not work on machines that do
I think we removed all usage of __managed__
.
When I last checked, it seemed that the cases left are:
-
#pragma acc wait
, which doesn't support anif
clause (but we can just doif(compute_gpu) { _Pragma("acc wait") }
) - Clauses in initialisation methods and for global variables, where there is no
NrnThread::compute_gpu
available
(I did not check that carefully, so there may be other cases too)
I think https://github.com/BlueBrain/CoreNeuron/pull/795 et al. improve the situation here, I just checked in an NMODL + OpenACC build and did not see any GPU activity in non-GPU tests.
I see the same in a MOD2C build with those branches, so I will mark this issue as closed by #795.