libCEED icon indicating copy to clipboard operation
libCEED copied to clipboard

OCCA SEGV when passing ROCm device memory

Open jedbrown opened this issue 4 years ago • 13 comments

@dmed256 Perhaps you can identify what's going wrong here. I pass a device pointer here (and it works with all the other /gpu/hip/* backends). The self->modeMemory->dtype_ is NULL so we get a SEGV here. I'm using occa-1.1.0 and we're on ROCm-4.0, though this looks to all be plain C++ code.

Thread 1 "petsc-bps" received signal SIGSEGV, Segmentation fault.
0x00007fffecee20e0 in occa::dtype_t::self (this=0x0) at /projects/occa/include/occa/dtype/dtype.hpp:52
52            return ref ? *ref : *this;
(gdb) bt
#0  0x00007fffecee20e0 in occa::dtype_t::self (this=0x0) at /projects/occa/include/occa/dtype/dtype.hpp:52
#1  0x00007fffecf031aa in occa::dtype_t::bytes (this=0x0) at /projects/occa/src/dtype/dtype.cpp:91
#2  0x00007fffecedf057 in occa::memory::slice (this=0x7fffffffbc30, offset=0, count=-1) at /projects/occa/src/core/memory.cpp:385
#3  0x00007fffecee1b85 in occa::memory::as (this=0x7fffffffbc30, dtype_=...) at /projects/occa/src/core/memory.cpp:536
#4  0x00007ffff4be413a in ceed::occa::arrayToMemory<double> (array=array@entry=0x7ffeb080f000) at /home/jeka2967/libceed/backends/occa/ceed-occa-vector.hpp:29
#5  0x00007ffff4be2e88 in ceed::occa::Vector::useArrayPointer (this=0x555555fc9ac0, mtype=<optimized out>, array=array@entry=0x7ffeb080f000)
    at /home/jeka2967/libceed/backends/occa/ceed-occa-vector.cpp:199
#6  0x00007ffff4be356c in ceed::occa::Vector::setArray (this=0x555555fc9ac0, mtype=mtype@entry=CEED_MEM_DEVICE, cmode=cmode@entry=CEED_USE_POINTER, array=array@entry=0x7ffeb080f000)
    at /home/jeka2967/libceed/backends/occa/ceed-occa-vector.cpp:119
#7  0x00007ffff4be35ce in ceed::occa::Vector::ceedSetArray (vec=<optimized out>, mtype=CEED_MEM_DEVICE, cmode=CEED_USE_POINTER, array=0x7ffeb080f000)
    at /home/jeka2967/libceed/backends/occa/ceed-occa-vector.cpp:366
#8  0x00007ffff4bb23b3 in CeedVectorSetArray (vec=0x555555fee7f0, mtype=<optimized out>, cmode=cmode@entry=CEED_USE_POINTER, array=<optimized out>) at /home/jeka2967/libceed/interface/ceed-vector.c:200
#9  0x000055555555c436 in RunWithDM (ceedresource=<optimized out>, dm=0x5555559ca330, rp=0x5555558ce510) at /home/jeka2967/libceed/examples/petsc/bps.c:196
#10 Run (rp=0x5555558ce510, num_resources=1, ceedresources=ceedresources@entry=0x7fffffffc120, num_bpchoices=<optimized out>, bpchoices=bpchoices@entry=0x7fffffffc020)
    at /home/jeka2967/libceed/examples/petsc/bps.c:421
#11 0x0000555555560009 in main (argc=<optimized out>, argv=<optimized out>) at /home/jeka2967/libceed/examples/petsc/bps.c:588
(gdb) up
#1  0x00007fffecf031aa in occa::dtype_t::bytes (this=0x0) at /projects/occa/src/dtype/dtype.cpp:91
91          return self().bytes_;
(gdb)
#2  0x00007fffecedf057 in occa::memory::slice (this=0x7fffffffbc30, offset=0, count=-1) at /projects/occa/src/core/memory.cpp:385
385         const int dtypeSize = modeMemory->dtype_->bytes();
(gdb) p modeMemory->dtype_
$1 = (const occa::dtype_t *) 0x0
(gdb) up
#3  0x00007fffecee1b85 in occa::memory::as (this=0x7fffffffbc30, dtype_=...) at /projects/occa/src/core/memory.cpp:536
536         occa::memory mem = slice(0);
(gdb)
#4  0x00007ffff4be413a in ceed::occa::arrayToMemory<double> (array=array@entry=0x7ffeb080f000) at /home/jeka2967/libceed/backends/occa/ceed-occa-vector.hpp:29
29              return mem.as(::occa::dtype::get<TM>());
(gdb) l
24        namespace occa {
25          template <class TM>
26          ::occa::memory arrayToMemory(const TM *array) {
27            if (array) {
28              ::occa::memory mem((::occa::modeMemory_t*) array);
29              return mem.as(::occa::dtype::get<TM>());
30            }
31            return ::occa::null;
32          }
33

I guess I'm confused about how the cast on line 28 is supposed to work. This array is just a pointer to (floating point) data on the device so I don't follow how it can be cast to modeMemory_t* and then dereference the dtype_ field.

jedbrown avatar Dec 23 '20 16:12 jedbrown

I think this is the same issue that we have in MFEM trying to use OCCA's HIP backend from libCEED. For example, running Example 1 with

./ex1 -pa -d ceed-hip:/gpu/hip/occa

produces a segfault with the following backtrace:

#0  0x00002aaaabc9d6ce in occa::modeMemory_t::addMemoryRef(occa::memory*) ()
   from /<deleted-path>/occa/lib/libocca.so
#1  0x00002aaaac0cb993 in ceed::occa::Vector::useArrayPointer(CeedMemType, double*) ()
   from /<deleted-path>/libceed/lib/libceed.so
#2  0x00002aaaac0cc3c8 in ceed::occa::Vector::setArray(CeedMemType, CeedCopyMode, double*) ()
   from /<deleted-path>/libceed/lib/libceed.so
#3  0x00002aaaac0cc458 in ceed::occa::Vector::ceedSetArray(CeedVector_private*, CeedMemType, CeedCopyMode, double*) ()
   from /<deleted-path>/libceed/lib/libceed.so
#4  0x00002aaaac08dd50 in CeedVectorSetArray ()
   from /<deleted-path>/libceed/lib/libceed.so
#5  0x0000000000782dec in mfem::InitCeedVector(mfem::Vector const&, CeedVector_private*&) ()
#6  0x000000000078315e in mfem::CeedPAAssemble(mfem::CeedPAOperator const&, mfem::CeedData&) ()
#7  0x00000000009df141 in mfem::CeedPADiffusionAssemble(mfem::FiniteElementSpace const&, mfem::IntegrationRule const&, mfem::CeedData&) ()
#8  0x000000000070ddb0 in mfem::PABilinearFormExtension::Assemble() ()
#9  0x000000000040c4b9 in main ()

I think we discussed this briefly with @dmed256 before the MFEM v4.2 release (that's when I first noticed this issue) but we did not pursue it further at the time. Basically, I think that ceed::occa::Vector::ceedSetArray does not work correctly when the CeedCopyMode is CEED_USE_POINTER. I think the external double* pointer needs to be wrapped as occa::memory using the backend-specific methods like occa::cuda::wrapMemory.

v-dobrev avatar Dec 24 '20 22:12 v-dobrev

I misunderstood what the getter and setter methods were for. I thought it was a way to get a handle to the data for the given backend (e.g. occa::modeMemory_t*) and not the underlying backend (e.g. CUDA pointer).

@jedbrown can you confirm we want to pass/retrieve C/C++/CUDA/HIP native pointers and not backend-specific pointers for these methods:

ceedSetArray
ceedGetArray

dmed256 avatar Dec 29 '20 16:12 dmed256

Yup, that's what we want. The use case in current 'main' is that we're getting the array (as a host, CUDA, or ROCm device pointer) from PETSc and having the CeedVector wrap it for application of an operator. I believe the implementations in CeedVectorSetArrayDevice_{Cuda,Hip} are correct (and they work in all cases we've considered). We could consider something different, but this mode is useful and flexible.

jedbrown avatar Dec 29 '20 16:12 jedbrown

Cool, thank you for clarifying. Making it easy to wrap a native pointer has been a feature I've wanted to add so it helps everyone :) (:bird: :bird: :curling_stone:)

dmed256 avatar Dec 29 '20 18:12 dmed256

Great, I hope this isn't too hard and we'll look forward to activating this usage with the OCCA backends.

jedbrown avatar Dec 29 '20 22:12 jedbrown

I think this isn't fixed yet since libCEED will need an update, though it should be simple.

jedbrown avatar Jan 02 '21 01:01 jedbrown

Oh weird, I think Github auto-closed this ticket when I merged the PR

image

dmed256 avatar Jan 02 '21 02:01 dmed256

Yeah, that's clearly what happened. I wonder if it would have done that if you didn't have write permission to this repo.

Are you able to update backends/occa/ or would you like us to give it a try?

jedbrown avatar Jan 02 '21 02:01 jedbrown

I started #688 but will need to re-tag the OCCA versions along with the PR changes. I'll tag 1.1.1 after these tests work on a fixed main commit.

dmed256 avatar Jan 02 '21 03:01 dmed256

Actually, I'm not sure where I can set the version/commit :thinking:

dmed256 avatar Jan 02 '21 03:01 dmed256

For testing, it's set in .gitlab-ci.yml

For documentation it should be on the README, I think

jeremylt avatar Jan 02 '21 03:01 jeremylt

Thank you!

dmed256 avatar Jan 02 '21 03:01 dmed256

I have not checked, but I imagine this is fixed now

jeremylt avatar Oct 13 '22 17:10 jeremylt