isaac icon indicating copy to clipboard operation
isaac copied to clipboard

ISAAC plugin exits with segmentation fault

Open benjha opened this issue 4 years ago • 12 comments

Hi @PrometheusPi @psychocoderHPC,

After several unsuccessful attempts to get some traces out with TAU, I ran PIConGPU &ISAAC in a default configuration (profiling off, dumping viz. frames to Alpine, 1000 steps with checkpoint.restart.loop=3, using the /etc/picongpu/8_isaac.cfg file) and noted the simulation breaks with the next errors at the end of its execution, which is the cause TAU can't generate the traces:

[h09n09:151879] *** Process received signal ***
[h09n09:151879] Signal: Segmentation fault (11)
[h09n09:151879] Signal code: Address not mapped (1)
[h09n09:151879] Failing at address: 0x3be700000008
[h09n09:151879] [ 0] [d22n15:170622] *** Process received signal ***
[d22n15:170622] Signal: Segmentation fault (11)
[d22n15:170622] Signal code: Address not mapped (1)
[d22n15:170622] Failing at address: 0x19f800000008
[h09n09:151880] *** Process received signal ***
[h09n09:151880] Signal: Segmentation fault (11)
[h09n09:151880] Signal code: Address not mapped (1)
[h09n09:151880] Failing at address: 0x19fe00000008
[h09n09:151880] [ 0] [0x2000000504d8]
[h09n09:151880] [ 1] [0x2000000504d8]
[h09n09:151879] [ 1] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x98)[0x1043e4d8]
[h09n09:151879] [ 2] [d22n15:170623] *** Process received signal ***
[d22n15:170623] Signal: Segmentation fault (11)
[d22n15:170623] Signal code: Address not mapped (1)
[d22n15:170623] Failing at address: 0x3be800000008
/gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x98)[0x1043e4d8]
[h09n09:151880] [ 2] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE12pluginUnloadEv+0xb8)[0x10369338]
[h09n09:151880] [ 3] [d22n15:170622] [ 0] [0x2000000504d8]
[d22n15:170622] [ 1] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE12pluginUnloadEv+0xb8)[0x10369338]
[h09n09:151879] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_000119dd_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x1030b514]
[h09n09:151879] [ 4] [d22n15:170623] [ 0] [0x2000000504d8]
[d22n15:170623] [ 1] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x98)[0x1043e4d8]
[d22n15:170623] [ 2] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x98)[0x1043e4d8]
[d22n15:170622] [ 2] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE12pluginUnloadEv+0xb8)[0x10369338]
[d22n15:170622] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_000119dd_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x1030b514]
[d22n15:170622] [ 4] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_000119dd_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x1030b514]
[h09n09:151880] [ 4] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(main+0x1c)[0x102f931c]
[h09n09:151880] [ 5] /lib64/libc.so.6(+0x25200)[0x200001095200]
[h09n09:151880] [ 6] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE12pluginUnloadEv+0xb8)[0x10369338]
[d22n15:170623] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_000119dd_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x1030b514]
[d22n15:170623] [ 4] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(main+0x1c)[0x102f931c]
[d22n15:170623] [ 5] /lib64/libc.so.6(+0x25200)[0x200001095200]
[d22n15:170623] [ 6] /gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(main+0x1c)[0x102f931c]
[h09n09:151879] [ 5] /lib64/libc.so.6(+0x25200)[0x200001095200]
[h09n09:151879] [ 6] /lib64/libc.so.6(__libc_start_main+0xc4)[0x2000010953f4]
[h09n09:151879] *** End of error message ***
/gpfs/alpine/proj-shared/csc434/benjha/picongpu-simulations/LWFA_ISAAC_perf/lwfa_isaac_1280x720/input/bin/picongpu(main+0x1c)[0x102f931c]
[d22n15:170622] [ 5] /lib64/libc.so.6(+0x25200)[0x200001095200]
[d22n15:170622] [ 6] /lib64/libc.so.6(__libc_start_main+0xc4)[0x2000010953f4]
[d22n15:170622] *** End of error message ***
/lib64/libc.so.6(__libc_start_main+0xc4)[0x2000010953f4]
[h09n09:151880] *** End of error message ***
/lib64/libc.so.6(__libc_start_main+0xc4)[0x2000010953f4]
[d22n15:170623] *** End of error message ***
ERROR:  One or more process (first noticed rank 6) terminated with signal 11 (core dumped)

Looks like the issue is in the IsaacPlugin.hpp's pluginUnload() method which in turn call the IsaacVisualization destructor.

Can you reproduce this error ?

benjha avatar Feb 01 '21 23:02 benjha

@benjha Thanks for reporting the error. Since we are currently pushing out new versions of our software, could you please specify which version you are using that creates the error:

  • PIConGPU
  • ISAAC
  • alpaka

Then we can quickly check whether we are able to reproduce the error on hemera as well.

PrometheusPi avatar Feb 02 '21 09:02 PrometheusPi

PIConGPU came from the dev branch dated back to Nov. 2020 with its own Alpaka distribution

commit 84e03980f2a56c7aea24d88bc3be9eb43f1a3197
Merge: aa86f2d c5208f4
Author: Sergei Bastrakov <[email protected]>
Date:   Wed Nov 25 10:50:46 2020 +0100

ISAAC:

commit 47c475ddd3fcd732964f5ce22edfe2fbcfae2b14
Merge: 3186666 74ab372
Author: Ren<C3><A9> Widera <[email protected]>
Date:   Fri Nov 6 13:30:40 2020 +0100

    Merge pull request #118 from ComputationalRadiationPhysics/dev
    
    Merge json-rodarae file to latetest release cadidate

benjha avatar Feb 02 '21 14:02 benjha

@benjha Thanks for providing the details. I will see whether I can reproduce this bug.

PrometheusPi avatar Feb 02 '21 15:02 PrometheusPi

Hi @PrometheusPi

I am installing current PIConGPU dev branch with ISAAC 1.5.2 to verify if they work properly from this case.

I am having a list of these errors:

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(112): error #135: namespace "alpaka" has no member "Dev"

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(112): error #65: expected a ";"

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(113): error #135: namespace "alpaka" has no member "DimInt"

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(113): error #65: expected a ";"

which likely is an Alpaka version mismatch between the one PIConGPU dev uses and ISAAC uses.

Were there any changes on the way compilation works?

benjha avatar Feb 08 '21 21:02 benjha

Hi @PrometheusPi

I am installing current PIConGPU dev branch with ISAAC 1.5.2 to verify if they work properly from this case.

I am having a list of these errors:

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(112): error #135: namespace "alpaka" has no member "Dev"

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(112): error #65: expected a ";"

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(113): error #135: namespace "alpaka" has no member "DimInt"

/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/gcc_6.4.0/include/isaac.hpp(113): error #65: expected a ";"

which likely is an Alpaka version mismatch between the one PIConGPU dev uses and ISAAC uses.

Were there any changes on the way compilation works?

Are you sure you used the release 1.5.2 and not the current dev branch? The dev branch of ISAAC is currently incompatible with the PIConGPU dev branch. There is a PR https://github.com/ComputationalRadiationPhysics/picongpu/pull/3498 in PIConGPU to fix it but we need to switch our PIConGPU CI first to the ISAAC dev branch.

The release 1.5.2 is currently checked together with PIConGPU dev.

psychocoderHPC avatar Feb 09 '21 19:02 psychocoderHPC

@FelixTUD Could you please test the current dev of PIConGPU together with the release 1.5.2?

psychocoderHPC avatar Feb 09 '21 20:02 psychocoderHPC

I've rechecked dependencies and fixed the Alpaka mismatch issue.

With PIConGPU current dev branch and ISAAC 1.5.2 following the next configuration:

#################################
## Section: Required Variables ##
#################################

TBG_wallTime="0:30:00"

TBG_devices_x=2
TBG_devices_y=2
TBG_devices_z=2

TBG_gridSize="192 2048 160"
TBG_steps="4000"

TBG_restartLoop="--checkpoint.restart.loop 1"


#################################
## Section: Optional Variables ##
#################################

TBG_isaac="--isaac.width 1280 --isaac.height 720 --isaac.period 1  --isaac.name !TBG_jobName  --isaac.url apps.marble.ccs.ornl.gov  --isaac.port 30167"


TBG_plugins="!TBG_isaac"

#################################
## Section: Program Parameters ##
#################################

TBG_deviceDist="!TBG_devices_x !TBG_devices_y !TBG_devices_z"

TBG_programParams="-d !TBG_deviceDist \
                   -g !TBG_gridSize   \
                   -s !TBG_steps      \
                   !TBG_restartLoop  \
                   !TBG_plugins      \
                   --versionOnce"

# TOTAL number of devices
TBG_tasks="$(( TBG_devices_x * TBG_devices_y * TBG_devices_z ))"

"$TBG_cfgPath"/submitAction.sh

PIConGPU throws the next errors:

$ cat stderr.725795
[a02n05:79941] *** Process received signal ***
[a02n05:79941] Signal: Segmentation fault (11)
[a02n05:79941] Signal code: Address not mapped (1)
[a02n05:79941] Failing at address: 0x12a000000008
[a18n18:153800] *** Process received signal ***
[a18n18:153800] Signal: Segmentation fault (11)
[a18n18:153800] Signal code: Address not mapped (1)
[a18n18:153800] Failing at address: 0x25a900000008
[a18n18:153800] [ 0] [0x2000000504d8]
[a18n18:153800] [ 1] [a02n05:79944] *** Process received signal ***
[a02n05:79944] Signal: Segmentation fault (11)
[a02n05:79944] Signal code: Address not mapped (1)
[a02n05:79944] Failing at address: 0x4bea00000008
/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN5isaac18IsaacVisualizationIN6alpaka6DevCpuENS1_12AccGpuCudaRtISt17integral_constantImLm3EEjEENS1_32QueueUniformCudaHipRtNonBlockingES5_N4mpl_4int_ILi3EEEN5boost6fusion4consIN8picongpu6isaacP14ParticleSourceINSE_9ParticlesIN5pmacc4meta6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEENSB_3mpl6v_itemINSE_11chargeRatioINSE_20ChargeRatioElectronsENSI_13pmacc_isAliasEEENSN_INSE_9massRatioINSE_18MassRatioElectronsESQ_EENSN_INSE_7currentINSE_13currentSolver9EsirkepovINSE_9particles6shapes3TSCENSW_8strategy16CachedSupercellsELj3EEESQ_EENSN_INSE_13interpolationINSE_28FieldToParticleInterpolationIS10_NSE_30AssignedTrilinearInterpolationEEESQ_EENSN_INSE_5shapeIS10_SQ_EENSN_INSE_14particlePusherINSY_6pusher5BorisESQ_EENSM_7vector0INS8_2naEEELi0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSN_INSE_9weightingENSN_INSE_8momentumENSN_INSE_8positionINSE_12position_picESQ_EES1I_Li0EEELi0EEELi0EEEEEEENSC_4nil_EEENSD_INSF_12TFieldSourceINSE_6FieldEEEENSD_INS21_INSE_6FieldBEEENSD_INS21_INSE_6FieldJEEENSD_INS21_INSE_17FieldTmpOperationINSY_14particleToGrid24ComputeGridValuePerFrameIS10_NS29_17derivedAttributes7DensityEEES1X_EEEES1Z_EEEEEEEENSI_9DataSpaceILj3EEELj1024ENSI_4math6VectorIfLi3ENS2M_16StandardAccessorENS2M_17StandardNavigatorENS2M_6detail17Vector_componentsEEENS_17DefaultControllerENS_17DefaultCompositorEED2Ev+0x58)[0x10398658]
[a18n18:153800] [ 2] [a02n05:79943] *** Process received signal ***
[a02n05:79943] Signal: Segmentation fault (11)
[a02n05:79943] Signal code: Address not mapped (1)
[a02n05:79943] Failing at address: 0x38e400000008
[a02n05:79943] [ 0] [0x2000000504d8]
[a02n05:79943] [ 1] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN5isaac18IsaacVisualizationIN6alpaka6DevCpuENS1_12AccGpuCudaRtISt17integral_constantImLm3EEjEENS1_32QueueUniformCudaHipRtNonBlockingES5_N4mpl_4int_ILi3EEEN5boost6fusion4consIN8picongpu6isaacP14ParticleSourceINSE_9ParticlesIN5pmacc4meta6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEENSB_3mpl6v_itemINSE_11chargeRatioINSE_20ChargeRatioElectronsENSI_13pmacc_isAliasEEENSN_INSE_9massRatioINSE_18MassRatioElectronsESQ_EENSN_INSE_7currentINSE_13currentSolver9EsirkepovINSE_9particles6shapes3TSCENSW_8strategy16CachedSupercellsELj3EEESQ_EENSN_INSE_13interpolationINSE_28FieldToParticleInterpolationIS10_NSE_30AssignedTrilinearInterpolationEEESQ_EENSN_INSE_5shapeIS10_SQ_EENSN_INSE_14particlePusherINSY_6pusher5BorisESQ_EENSM_7vector0INS8_2naEEELi0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSN_INSE_9weightingENSN_INSE_8momentumENSN_INSE_8positionINSE_12position_picESQ_EES1I_Li0EEELi0EEELi0EEEEEEENSC_4nil_EEENSD_INSF_12TFieldSourceINSE_6FieldEEEENSD_INS21_INSE_6FieldBEEENSD_INS21_INSE_6FieldJEEENSD_INS21_INSE_17FieldTmpOperationINSY_14particleToGrid24ComputeGridValuePerFrameIS10_NS29_17derivedAttributes7DensityEEES1X_EEEES1Z_EEEEEEEENSI_9DataSpaceILj3EEELj1024ENSI_4math6VectorIfLi3ENS2M_16StandardAccessorENS2M_17StandardNavigatorENS2M_6detail17Vector_componentsEEENS_17DefaultControllerENS_17DefaultCompositorEED2Ev+0x58)[0x10398658]
[a02n05:79943] [ 2] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x40)[0x1041b1f0]
[a18n18:153800] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_10SimulationEE12pluginUnloadEv+0xb8)[0x10355a28]
[a18n18:153800] [ 4] [a02n05:79941] [ 0] [0x2000000504d8]
[a02n05:79941] [ 1] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN5isaac18IsaacVisualizationIN6alpaka6DevCpuENS1_12AccGpuCudaRtISt17integral_constantImLm3EEjEENS1_32QueueUniformCudaHipRtNonBlockingES5_N4mpl_4int_ILi3EEEN5boost6fusion4consIN8picongpu6isaacP14ParticleSourceINSE_9ParticlesIN5pmacc4meta6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEENSB_3mpl6v_itemINSE_11chargeRatioINSE_20ChargeRatioElectronsENSI_13pmacc_isAliasEEENSN_INSE_9massRatioINSE_18MassRatioElectronsESQ_EENSN_INSE_7currentINSE_13currentSolver9EsirkepovINSE_9particles6shapes3TSCENSW_8strategy16CachedSupercellsELj3EEESQ_EENSN_INSE_13interpolationINSE_28FieldToParticleInterpolationIS10_NSE_30AssignedTrilinearInterpolationEEESQ_EENSN_INSE_5shapeIS10_SQ_EENSN_INSE_14particlePusherINSY_6pusher5BorisESQ_EENSM_7vector0INS8_2naEEELi0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSN_INSE_9weightingENSN_INSE_8momentumENSN_INSE_8positionINSE_12position_picESQ_EES1I_Li0EEELi0EEELi0EEEEEEENSC_4nil_EEENSD_INSF_12TFieldSourceINSE_6FieldEEEENSD_INS21_INSE_6FieldBEEENSD_INS21_INSE_6FieldJEEENSD_INS21_INSE_17FieldTmpOperationINSY_14particleToGrid24ComputeGridValuePerFrameIS10_NS29_17derivedAttributes7DensityEEES1X_EEEES1Z_EEEEEEEENSI_9DataSpaceILj3EEELj1024ENSI_4math6VectorIfLi3ENS2M_16StandardAccessorENS2M_17StandardNavigatorENS2M_6detail17Vector_componentsEEENS_17DefaultControllerENS_17DefaultCompositorEED2Ev+0x58)[0x10398658]
[a02n05:79941] [ 2] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x40)[0x1041b1f0]
[a02n05:79941] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_0001329e_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x102fa874]
[a18n18:153800] [ 5] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(main+0x1c)[0x102eb4ac]
[a18n18:153800] [ 6] /lib64/libc.so.6(+0x25200)[0x200000e75200]
[a18n18:153800] [ 7] /lib64/libc.so.6(__libc_start_main+0xc4)[0x200000e753f4]
[a18n18:153800] *** End of error message ***
[a02n05:79944] [ 0] [0x2000000504d8]
[a02n05:79944] [ 1] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN5isaac18IsaacVisualizationIN6alpaka6DevCpuENS1_12AccGpuCudaRtISt17integral_constantImLm3EEjEENS1_32QueueUniformCudaHipRtNonBlockingES5_N4mpl_4int_ILi3EEEN5boost6fusion4consIN8picongpu6isaacP14ParticleSourceINSE_9ParticlesIN5pmacc4meta6StringIJLc101ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0ELc0EEEENSB_3mpl6v_itemINSE_11chargeRatioINSE_20ChargeRatioElectronsENSI_13pmacc_isAliasEEENSN_INSE_9massRatioINSE_18MassRatioElectronsESQ_EENSN_INSE_7currentINSE_13currentSolver9EsirkepovINSE_9particles6shapes3TSCENSW_8strategy16CachedSupercellsELj3EEESQ_EENSN_INSE_13interpolationINSE_28FieldToParticleInterpolationIS10_NSE_30AssignedTrilinearInterpolationEEESQ_EENSN_INSE_5shapeIS10_SQ_EENSN_INSE_14particlePusherINSY_6pusher5BorisESQ_EENSM_7vector0INS8_2naEEELi0EEELi0EEELi0EEELi0EEELi0EEELi0EEENSN_INSE_9weightingENSN_INSE_8momentumENSN_INSE_8positionINSE_12position_picESQ_EES1I_Li0EEELi0EEELi0EEEEEEENSC_4nil_EEENSD_INSF_12TFieldSourceINSE_6FieldEEEENSD_INS21_INSE_6FieldBEEENSD_INS21_INSE_6FieldJEEENSD_INS21_INSE_17FieldTmpOperationINSY_14particleToGrid24ComputeGridValuePerFrameIS10_NS29_17derivedAttributes7DensityEEES1X_EEEES1Z_EEEEEEEENSI_9DataSpaceILj3EEELj1024ENSI_4math6VectorIfLi3ENS2M_16StandardAccessorENS2M_17StandardNavigatorENS2M_6detail17Vector_componentsEEENS_17DefaultControllerENS_17DefaultCompositorEED2Ev+0x58)[0x10398658]
[a02n05:79944] [ 2] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x40)[0x1041b1f0]
[a02n05:79944] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_10SimulationEE12pluginUnloadEv+0xb8)[0x10355a28]
[a02n05:79941] [ 4] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_0001329e_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x102fa874]
[a02n05:79941] [ 5] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(main+0x1c)[0x102eb4ac]
[a02n05:79941] [ 6] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_10SimulationEE12pluginUnloadEv+0xb8)[0x10355a28]
[a02n05:79944] [ 4] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_0001329e_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x102fa874]
[a02n05:79944] [ 5] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(main+0x1c)[0x102eb4ac]
[a02n05:79944] [ 6] /lib64/libc.so.6(+0x25200)[0x200000e75200]
[a02n05:79944] [ 7] /lib64/libc.so.6(__libc_start_main+0xc4)[0x200000e753f4]
[a02n05:79944] *** End of error message ***
/gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu6isaacP11IsaacPlugin12pluginUnloadEv+0x40)[0x1041b1f0]
[a02n05:79943] [ 3] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_10SimulationEE12pluginUnloadEv+0xb8)[0x10355a28]
[a02n05:79943] [ 4] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(_ZN63_GLOBAL__N__39_tmpxft_0001329e_00000000_6_main_cpp1_ii_5586f50813runSimulationEiPPc+0x664)[0x102fa874]
[a02n05:79943] [ 5] /gpfs/alpine/proj-shared/csc434/benjha/src/picongpu_02082021/simulations/LWFA_ISAAC_perf/lwfa_1280x720/input/bin/picongpu(main+0x1c)[0x102eb4ac]
[a02n05:79943] [ 6] /lib64/libc.so.6(+0x25200)[0x200000e75200]
[a02n05:79943] [ 7] /lib64/libc.so.6(__libc_start_main+0xc4)[0x200000e753f4]
[a02n05:79943] *** End of error message ***
/lib64/libc.so.6(+0x25200)[0x200000e75200]
[a02n05:79941] [ 7] /lib64/libc.so.6(__libc_start_main+0xc4)[0x200000e753f4]
[a02n05:79941] *** End of error message ***
ERROR:  One or more process (first noticed rank 7) terminated with signal 11 (core dumped)

this is the output from ISAAC-server:

$  isaac --dump /gpfs/alpine/proj-shared/csc434/PIConGPU_ISAAC_SLATE_output &
[1] 15
sh-4.2$ Using web_port=2459, tcp_port=2458 and sim_port=2460

Running ISAAC Master
Starting insitu plugin listener
Launching WebSocketDataConnector
Launching TCPDataConnector
Launching SaveFileImageConnector
Launching JPEG_URI_Stream
New connection, giving id 0 (control)
Group complete, sending to connected interfaces
sh-4.2$ Connection 0 closed.
Removed group 0

For now, I will be dumping the ISAAC timers into files, but will be great to get more insight by using a profiler.

benjha avatar Feb 09 '21 20:02 benjha

@psychocoderHPC I'm looking into it, a LWFA setup compiles without a problem on hemera with pic dev and isaac 1.5.2

FelixTUD avatar Feb 09 '21 21:02 FelixTUD

I can reproduce an identical error with an mpi execution of the example, this should help me tracking down the problem

FelixTUD avatar Feb 09 '21 21:02 FelixTUD

@benjha I might have found the error, you can try and remove the line https://github.com/ComputationalRadiationPhysics/isaac/blob/c7e9ff9bafe9e65811fc116fe06d5db8a51f7c5e/lib/isaac.hpp#L3465 as a hotfix. I need to have a more detailed look into it later, as it seems that json_init_root is only initialized on the master node, thats why it throws seg fault on all other nodes on destruction, let me know if it fixed it for now

FelixTUD avatar Feb 09 '21 21:02 FelixTUD

Thanks @FelixTUD It worked.

I am testing further...

benjha avatar Feb 10 '21 19:02 benjha

This should be fixed with #132

FelixTUD avatar Mar 09 '21 18:03 FelixTUD