charm icon indicating copy to clipboard operation
charm copied to clipboard

Bad memory accesses in springCleaning()

Open trquinn opened this issue 1 year ago • 3 comments

Running ChaNGa under valgrind reports that CkArray::springCleaning() is accessing freed memory. This is with ChaNGa version 3.5 commit v3.5-11-gc7ba57c0 and charm version v7.1.0-devel-321-g606459e74 This is built on an AMD/infiniband machine with mpi-linux-x86_64-smp with gcc v11.2.0 and mvapchi2 2.3.6

Soon after writing an output (using CkIO) valgrind reports errors like:

==3563697== Invalid read of size 8
==3563697==    at 0x7DC2DB: CkArray::staticSpringCleaning(void*) (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x88A663: CcdRaiseCondition (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x88AD15: CcdCallBacks (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x88EE0B: CsdScheduleForever (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x88F234: CsdScheduler (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x8DDD46: ConverseRunPE(int) (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x8DDD8A: call_startfn(void*) (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x50A91CE: start_thread (in /usr/lib64/libpthread-2.28.so)
==3563697==    by 0x6CDBE72: clone (in /usr/lib64/libc-2.28.so)
==3563697==  Address 0x848f768 is 1,016 bytes inside a block of size 1,024 free'd
==3563697==    at 0x4C4AB30: free (in /apps/spack/anvil/apps/valgrind/3.15.0-gcc-11.2.0-u7tvx2t/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==3563697==    by 0x7DF84B: CkIndex_CkArray::_call_ckDestroy_void(void*, void*) (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x7D4ABF: CkDeliverMessageFree (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x7D8194: _processHandler(void*, CkCoreState*) (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x88EDCB: CsdScheduleForever (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x88F234: CsdScheduler (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x8DDD46: ConverseRunPE(int) (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x8DDD8A: call_startfn(void*) (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x50A91CE: start_thread (in /usr/lib64/libpthread-2.28.so)
==3563697==    by 0x6CDBE72: clone (in /usr/lib64/libc-2.28.so)
==3563697==  Block was alloc'd at
==3563697==    at 0x4C495ED: malloc (in /apps/spack/anvil/apps/valgrind/3.15.0-gcc-11.2.0-u7tvx2t/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==3563697==    by 0x7D5BCD: CkCreateLocalGroup (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x7D858E: _processHandler(void*, CkCoreState*) (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x88EDCB: CsdScheduleForever (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x88F234: CsdScheduler (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x8DDD46: ConverseRunPE(int) (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x8DDD8A: call_startfn(void*) (in /anvil/scratch/x-trq/testdata/ChaNGa.smp)
==3563697==    by 0x50A91CE: start_thread (in /usr/lib64/libpthread-2.28.so)
==3563697==    by 0x6CDBE72: clone (in /usr/lib64/libc-2.28.so)
==3563697== 

trquinn avatar Aug 25 '23 03:08 trquinn

I discovered this while investigating #3678 so it may be related.

trquinn avatar Aug 25 '23 04:08 trquinn

Mathew, I am adding you to this issue simply because you are familiar with ckio. But the issue (probably) has to do with "spring cleaning" garbage collection scheme for broadcasts, applying to deleted chare arrays when it should not.

lvkale avatar Sep 22 '23 01:09 lvkale

This crash can be reproduced on stampede3 by compiling ChaNGa, changing to the "teststep" directory and running: ../ChaNGa.smp ++ppn 12 -n 1000 -oi 10 +setcpuaffinity +commap 0,1 +pemap 2-46:2,3-47:2 -binout 6 test_pg.param The program will run for about 800 seconds before crashing.

trquinn avatar Apr 08 '24 03:04 trquinn