libCEED
libCEED copied to clipboard
OCCA SEGV when passing ROCm device memory
@dmed256 Perhaps you can identify what's going wrong here. I pass a device pointer here (and it works with all the other /gpu/hip/* backends). The self->modeMemory->dtype_ is NULL so we get a SEGV here. I'm using occa-1.1.0 and we're on ROCm-4.0, though this looks to all be plain C++ code.
Thread 1 "petsc-bps" received signal SIGSEGV, Segmentation fault.
0x00007fffecee20e0 in occa::dtype_t::self (this=0x0) at /projects/occa/include/occa/dtype/dtype.hpp:52
52 return ref ? *ref : *this;
(gdb) bt
#0 0x00007fffecee20e0 in occa::dtype_t::self (this=0x0) at /projects/occa/include/occa/dtype/dtype.hpp:52
#1 0x00007fffecf031aa in occa::dtype_t::bytes (this=0x0) at /projects/occa/src/dtype/dtype.cpp:91
#2 0x00007fffecedf057 in occa::memory::slice (this=0x7fffffffbc30, offset=0, count=-1) at /projects/occa/src/core/memory.cpp:385
#3 0x00007fffecee1b85 in occa::memory::as (this=0x7fffffffbc30, dtype_=...) at /projects/occa/src/core/memory.cpp:536
#4 0x00007ffff4be413a in ceed::occa::arrayToMemory<double> (array=array@entry=0x7ffeb080f000) at /home/jeka2967/libceed/backends/occa/ceed-occa-vector.hpp:29
#5 0x00007ffff4be2e88 in ceed::occa::Vector::useArrayPointer (this=0x555555fc9ac0, mtype=<optimized out>, array=array@entry=0x7ffeb080f000)
at /home/jeka2967/libceed/backends/occa/ceed-occa-vector.cpp:199
#6 0x00007ffff4be356c in ceed::occa::Vector::setArray (this=0x555555fc9ac0, mtype=mtype@entry=CEED_MEM_DEVICE, cmode=cmode@entry=CEED_USE_POINTER, array=array@entry=0x7ffeb080f000)
at /home/jeka2967/libceed/backends/occa/ceed-occa-vector.cpp:119
#7 0x00007ffff4be35ce in ceed::occa::Vector::ceedSetArray (vec=<optimized out>, mtype=CEED_MEM_DEVICE, cmode=CEED_USE_POINTER, array=0x7ffeb080f000)
at /home/jeka2967/libceed/backends/occa/ceed-occa-vector.cpp:366
#8 0x00007ffff4bb23b3 in CeedVectorSetArray (vec=0x555555fee7f0, mtype=<optimized out>, cmode=cmode@entry=CEED_USE_POINTER, array=<optimized out>) at /home/jeka2967/libceed/interface/ceed-vector.c:200
#9 0x000055555555c436 in RunWithDM (ceedresource=<optimized out>, dm=0x5555559ca330, rp=0x5555558ce510) at /home/jeka2967/libceed/examples/petsc/bps.c:196
#10 Run (rp=0x5555558ce510, num_resources=1, ceedresources=ceedresources@entry=0x7fffffffc120, num_bpchoices=<optimized out>, bpchoices=bpchoices@entry=0x7fffffffc020)
at /home/jeka2967/libceed/examples/petsc/bps.c:421
#11 0x0000555555560009 in main (argc=<optimized out>, argv=<optimized out>) at /home/jeka2967/libceed/examples/petsc/bps.c:588
(gdb) up
#1 0x00007fffecf031aa in occa::dtype_t::bytes (this=0x0) at /projects/occa/src/dtype/dtype.cpp:91
91 return self().bytes_;
(gdb)
#2 0x00007fffecedf057 in occa::memory::slice (this=0x7fffffffbc30, offset=0, count=-1) at /projects/occa/src/core/memory.cpp:385
385 const int dtypeSize = modeMemory->dtype_->bytes();
(gdb) p modeMemory->dtype_
$1 = (const occa::dtype_t *) 0x0
(gdb) up
#3 0x00007fffecee1b85 in occa::memory::as (this=0x7fffffffbc30, dtype_=...) at /projects/occa/src/core/memory.cpp:536
536 occa::memory mem = slice(0);
(gdb)
#4 0x00007ffff4be413a in ceed::occa::arrayToMemory<double> (array=array@entry=0x7ffeb080f000) at /home/jeka2967/libceed/backends/occa/ceed-occa-vector.hpp:29
29 return mem.as(::occa::dtype::get<TM>());
(gdb) l
24 namespace occa {
25 template <class TM>
26 ::occa::memory arrayToMemory(const TM *array) {
27 if (array) {
28 ::occa::memory mem((::occa::modeMemory_t*) array);
29 return mem.as(::occa::dtype::get<TM>());
30 }
31 return ::occa::null;
32 }
33
I guess I'm confused about how the cast on line 28 is supposed to work. This array is just a pointer to (floating point) data on the device so I don't follow how it can be cast to modeMemory_t* and then dereference the dtype_ field.
I think this is the same issue that we have in MFEM trying to use OCCA's HIP backend from libCEED. For example, running Example 1 with
./ex1 -pa -d ceed-hip:/gpu/hip/occa
produces a segfault with the following backtrace:
#0 0x00002aaaabc9d6ce in occa::modeMemory_t::addMemoryRef(occa::memory*) ()
from /<deleted-path>/occa/lib/libocca.so
#1 0x00002aaaac0cb993 in ceed::occa::Vector::useArrayPointer(CeedMemType, double*) ()
from /<deleted-path>/libceed/lib/libceed.so
#2 0x00002aaaac0cc3c8 in ceed::occa::Vector::setArray(CeedMemType, CeedCopyMode, double*) ()
from /<deleted-path>/libceed/lib/libceed.so
#3 0x00002aaaac0cc458 in ceed::occa::Vector::ceedSetArray(CeedVector_private*, CeedMemType, CeedCopyMode, double*) ()
from /<deleted-path>/libceed/lib/libceed.so
#4 0x00002aaaac08dd50 in CeedVectorSetArray ()
from /<deleted-path>/libceed/lib/libceed.so
#5 0x0000000000782dec in mfem::InitCeedVector(mfem::Vector const&, CeedVector_private*&) ()
#6 0x000000000078315e in mfem::CeedPAAssemble(mfem::CeedPAOperator const&, mfem::CeedData&) ()
#7 0x00000000009df141 in mfem::CeedPADiffusionAssemble(mfem::FiniteElementSpace const&, mfem::IntegrationRule const&, mfem::CeedData&) ()
#8 0x000000000070ddb0 in mfem::PABilinearFormExtension::Assemble() ()
#9 0x000000000040c4b9 in main ()
I think we discussed this briefly with @dmed256 before the MFEM v4.2 release (that's when I first noticed this issue) but we did not pursue it further at the time. Basically, I think that ceed::occa::Vector::ceedSetArray does not work correctly when the CeedCopyMode is CEED_USE_POINTER. I think the external double* pointer needs to be wrapped as occa::memory using the backend-specific methods like occa::cuda::wrapMemory.
I misunderstood what the getter and setter methods were for. I thought it was a way to get a handle to the data for the given backend (e.g. occa::modeMemory_t*) and not the underlying backend (e.g. CUDA pointer).
@jedbrown can you confirm we want to pass/retrieve C/C++/CUDA/HIP native pointers and not backend-specific pointers for these methods:
ceedSetArray
ceedGetArray
Yup, that's what we want. The use case in current 'main' is that we're getting the array (as a host, CUDA, or ROCm device pointer) from PETSc and having the CeedVector wrap it for application of an operator. I believe the implementations in CeedVectorSetArrayDevice_{Cuda,Hip} are correct (and they work in all cases we've considered). We could consider something different, but this mode is useful and flexible.
Cool, thank you for clarifying. Making it easy to wrap a native pointer has been a feature I've wanted to add so it helps everyone :) (:bird: :bird: :curling_stone:)
Great, I hope this isn't too hard and we'll look forward to activating this usage with the OCCA backends.
I think this isn't fixed yet since libCEED will need an update, though it should be simple.
Oh weird, I think Github auto-closed this ticket when I merged the PR

Yeah, that's clearly what happened. I wonder if it would have done that if you didn't have write permission to this repo.
Are you able to update backends/occa/ or would you like us to give it a try?
I started #688 but will need to re-tag the OCCA versions along with the PR changes. I'll tag 1.1.1 after these tests work on a fixed main commit.
Actually, I'm not sure where I can set the version/commit :thinking:
For testing, it's set in .gitlab-ci.yml
For documentation it should be on the README, I think
Thank you!
I have not checked, but I imagine this is fixed now