dash icon indicating copy to clipboard operation
dash copied to clipboard

dash::copy not working between containers in different teams

Open knuedd opened this issue 7 years ago • 4 comments

dash::copy (both, in global-to-global and in global-to-local mode) segfaults when one wants to copy between containers that have different teams associated to them.

The example where you can check this can be found in dash-apps --> multigrid/multigrid3d_elastic.cpp. This currently still needs the feat-halo branch.

... we talked about this at the project meeting last week. If you need more details, I'll be happy to bring them.

Thanks, Andreas

knuedd avatar Oct 16 '17 11:10 knuedd

Andreas,

Thanks for opening a ticket, that helps tracking the issue. It's still not clear what is going wrong here... Before starting to debug this, do you happen to have a stack trace at hand?

devreal avatar Oct 16 '17 11:10 devreal

==== backtrace ====
 2 0x00000000000575cc mxm_handle_error()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u7-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.8.0-gcc-OFED-3.18-redhat6.7-x86_64/mxm-v3.6/src/mxm/util/debug/debug.c:641
 3 0x000000000005773c mxm_error_signal_handler()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u7-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.8.0-gcc-OFED-3.18-redhat6.7-x86_64/mxm-v3.6/src/mxm/util/debug/debug.c:616
 4 0x0000003afca32510 killpg()  ??:0
 5 0x0000003afca89782 memcpy()  ??:0
 6 0x000000000041262f _ZN4dash4copyIdNS_8GlobIterIdNS_12BlockPatternILi3ELNS_10MemArrangeE1ElEENS_13GlobStaticMemIdNS_9allocator18SymmetricAllocatorIdEEEENS_7GlobPtrIdS9_EENS_7GlobRefIdEEEEEEPT_T0_SH_SG_()  /sw/taurus/libraries/dash/dash-feat-halo_14-09-2017/include/dash/algorithm/Copy.h:878
 7 0x000000000040af83 _Z15transfertofewerR5LevelS0_()  /home/knuepfe/prog/dash-apps/multigrid/multigrid3d_elastic.cpp:611
 8 0x000000000040c358 _Z7v_cycleN9__gnu_cxx17__normal_iteratorIPKP5LevelSt6vectorIS2_SaIS2_EEEES8_jd()  /home/knuepfe/prog/dash-apps/multigrid/multigrid3d_elastic.cpp:852
 9 0x000000000040c79b _Z7v_cycleN9__gnu_cxx17__normal_iteratorIPKP5LevelSt6vectorIS2_SaIS2_EEEES8_jd()  /home/knuepfe/prog/dash-apps/multigrid/multigrid3d_elastic.cpp:903
10 0x000000000040c79b _Z7v_cycleN9__gnu_cxx17__normal_iteratorIPKP5LevelSt6vectorIS2_SaIS2_EEEES8_jd()  /home/knuepfe/prog/dash-apps/multigrid/multigrid3d_elastic.cpp:903
11 0x000000000040c79b _Z7v_cycleN9__gnu_cxx17__normal_iteratorIPKP5LevelSt6vectorIS2_SaIS2_EEEES8_jd()  /home/knuepfe/prog/dash-apps/multigrid/multigrid3d_elastic.cpp:903
12 0x000000000040da6b main()  /home/knuepfe/prog/dash-apps/multigrid/multigrid3d_elastic.cpp:1158
13 0x0000003afca1ed1d __libc_start_main()  ??:0
14 0x0000000000407771 _start()  ??:0
===================

knuedd avatar Oct 16 '17 12:10 knuedd

This appears to be a bug somewhere in the pattern code. Here is what I have so far:

dash::copy first assumes that the copy is all local because the range returned by dash::local_index_range(in_first, in_last) has the length of the total_copy_elem. However, the call to in_first.local() returns nullptr because _pattern->local(idx) claims that the values are located on another unit.

I'm afraid that unless I'm spending significant amount of time paging through the pattern code I won't be of much help. I think this is a job for @fuchsto

devreal avatar Oct 16 '17 15:10 devreal

@devreal Aye!

fuchsto avatar Nov 02 '17 22:11 fuchsto