unified-memory-framework icon indicating copy to clipboard operation
unified-memory-framework copied to clipboard

*multiThreadedpow2AlignedAlloc/disjoint_w_params* tests fail sporadically

Open ldorau opened this issue 4 months ago • 8 comments

*multiThreadedpow2AlignedAlloc/disjoint_w_params* tests:

  • mallocPoolTest/umfPoolTest.multiThreadedpow2AlignedAlloc/disjoint_w_params_2_umf_ba_global (test_memoryPool) and
  • disjointPoolTests/umfPoolTest.multiThreadedpow2AlignedAlloc/disjoint_w_params_0_umf_ba_global (test_disjoint_pool)

fail sporadically in the following way: https://github.com/oneapi-src/unified-memory-framework/actions/runs/16843892970/job/47720079405

[ RUN      ] mallocPoolTest/umfPoolTest.multiThreadedpow2AlignedAlloc/disjoint_w_params_2_umf_ba_global
/home/runner/work/unified-memory-framework/unified-memory-framework/test/poolFixtures.hpp:221: Failure
Expected: (ptr) != (nullptr), actual: NULL vs (nullptr)

or: https://github.com/ldorau/unified-memory-framework/actions/runs/16845396177/job/47724161570

[ RUN      ] disjointPoolTests/umfPoolTest.multiThreadedpow2AlignedAlloc/disjoint_w_params_0_umf_ba_global
/home/testuser/test/poolFixtures.hpp:221: Failure
Expected: (ptr) != (nullptr), actual: NULL vs (nullptr)

Environment Information

  • UMF version (hash commit or a tag): cc0565d6ca4628c78b9ab16d42e122d610e9c7e2
  • OS(es) version(s): Linux

Please provide a reproduction of the bug:

$ while ./test/test_memoryPool --gtest_filter="*multiThreadedpow2AlignedAlloc/disjoint_w_params*" > ./log.txt 2>&1 && ./test/test_disjoint_pool --gtest_filter="*multiThreadedpow2AlignedAllo
c/disjoint_w_params*" > ./log.txt 2>&1 ; do date ; done

How often bug is revealed:

rare

Details

The root cause is pool_register_slab: register failed because the address is already registered!:

[PID:1835396 TID:1835401 ERROR UMF] pool_register_slab: register failed because the address is already registered!
[PID:1835396 TID:1835401 ERROR UMF] bucket_create_slab: slab_reg failed!

More logs:

$ grep -e "ERROR UMF" -e Failure -e 0x7fd07f81e008 ./log.txt
[PID:1835396 TID:1835401 DEBUG UMF] umfMemoryTrackerAddAtLevel: memory region is added, tracker=0x7fd07fe40068, level=0, pool=0x7fd07fe40268, ptr=0x7fd07f81e008, size=4096
[PID:1835396 TID:1835401 DEBUG UMF] pool_register_slab: slab: 0x7fd07fe493e8, start: 0x7fd07f81e008
[PID:1835396 TID:1835400 DEBUG UMF] umfMemoryTrackerRemove: memory region removed: tracker=0x7fd07fe40068, level=0, pool=0x7fd07fe40268, ptr=0x7fd07f81e008, size=4096
[PID:1835396 TID:1835401 DEBUG UMF] umfMemoryTrackerAddAtLevel: memory region is added, tracker=0x7fd07fe40068, level=0, pool=0x7fd07fe40268, ptr=0x7fd07f81e008, size=4096
[PID:1835396 TID:1835401 DEBUG UMF] pool_register_slab: slab: 0x7fd07fe496e8, start: 0x7fd07f81e008
[PID:1835396 TID:1835400 DEBUG UMF] pool_unregister_slab: slab: 0x7fd07fe493e8, start: 0x7fd07f81e008
[PID:1835396 TID:1835401 ERROR UMF] pool_register_slab: register failed because the address is already registered! (slab: 0x7fd07fe496e8, start: 0x7fd07f81e008)
[PID:1835396 TID:1835401 ERROR UMF] bucket_create_slab: slab_reg failed!
[PID:1835396 TID:1835401 DEBUG UMF] umfMemoryTrackerRemove: memory region removed: tracker=0x7fd07fe40068, level=0, pool=0x7fd07fe40268, ptr=0x7fd07f81e008, size=4096
[PID:1835396 TID:1835399 DEBUG UMF] umfMemoryTrackerAddAtLevel: memory region is added, tracker=0x7fd07fe40068, level=0, pool=0x7fd07fe40268, ptr=0x7fd07f81e008, size=4096
[PID:1835396 TID:1835399 DEBUG UMF] pool_register_slab: slab: 0x7fd07fe495e8, start: 0x7fd07f81e008
/home/ldorau/work/unified-memory-framework/test/poolFixtures.hpp:221: Failure

and:

$ grep -e "ERROR UMF" -e Failure -e 0x7f15c647f008 ./log.txt
[PID:772    TID:776    DEBUG UMF] umfMemoryTrackerAddAtLevel: memory region is added, tracker=0x7f15c64b8068, level=0, pool=0x7f15c64b8268, ptr=0x7f15c647f008, size=4096
[PID:772    TID:776    DEBUG UMF] pool_register_slab: slab: 0x7f15c64c1468, start: 0x7f15c647f008
[PID:772    TID:776    DEBUG UMF] umfMemoryTrackerRemove: memory region removed: tracker=0x7f15c64b8068, level=0, pool=0x7f15c64b8268, ptr=0x7f15c647f008, size=4096
[PID:772    TID:773    DEBUG UMF] umfMemoryTrackerAddAtLevel: memory region is added, tracker=0x7f15c64b8068, level=0, pool=0x7f15c64b8268, ptr=0x7f15c647f008, size=4096
[PID:772    TID:773    DEBUG UMF] pool_register_slab: slab: 0x7f15c64c17e8, start: 0x7f15c647f008
[PID:772    TID:773    ERROR UMF] pool_register_slab: register failed because the address is already registered! (slab: 0x7f15c64c17e8, start: 0x7f15c647f008)
[PID:772    TID:773    ERROR UMF] bucket_create_slab: slab_reg failed!
[PID:772    TID:773    DEBUG UMF] umfMemoryTrackerRemove: memory region removed: tracker=0x7f15c64b8068, level=0, pool=0x7f15c64b8268, ptr=0x7f15c647f008, size=4096
[PID:772    TID:776    DEBUG UMF] pool_unregister_slab: slab: 0x7f15c64c1468, start: 0x7f15c647f008
[PID:772    TID:775    DEBUG UMF] umfMemoryTrackerAddAtLevel: memory region is added, tracker=0x7f15c64b8068, level=0, pool=0x7f15c64b8268, ptr=0x7f15c647f008, size=4096
[PID:772    TID:775    DEBUG UMF] pool_register_slab: slab: 0x7f15c64c1668, start: 0x7f15c647f008
/home/ldorau/work/unified-memory-framework/test/poolFixtures.hpp:221: Failure

ldorau avatar Aug 11 '25 12:08 ldorau

Logs attached:

log1.zip log2.zip

ldorau avatar Aug 11 '25 12:08 ldorau

The culprit is (found by git-bisect):

7930e59d71a3bf21d747c539c310b9825e7ddee0 is the first bad commit
commit 7930e59d71a3bf21d747c539c310b9825e7ddee0
Author: Rafal Rudnicki <[email protected]>
Date:   Mon Jul 21 14:26:48 2025 +0000

    implement umfPoolTrimMemory
---
bisect found first bad commit

ldorau avatar Aug 12 '25 07:08 ldorau

The last failure: https://github.com/ldorau/unified-memory-framework/actions/runs/16902790459/job/47885775331

ldorau avatar Aug 12 '25 08:08 ldorau

5 of 6 weekly CI builds failed because of this issue: https://github.com/oneapi-src/unified-memory-framework/actions/runs/16843892970

ldorau avatar Aug 12 '25 09:08 ldorau

Next failure: https://github.com/ldorau/unified-memory-framework/actions/runs/16955859410/job/48057824688

ldorau avatar Aug 14 '25 07:08 ldorau

It can be connected with: https://github.com/oneapi-src/unified-memory-framework/issues/1492

ldorau avatar Aug 14 '25 08:08 ldorau

Next failure: https://github.com/oneapi-src/unified-memory-framework/actions/runs/16962600184/job/48079262217

ldorau avatar Aug 14 '25 10:08 ldorau

The culprit is (found by git-bisect):

7930e59d71a3bf21d747c539c310b9825e7ddee0 is the first bad commit
commit 7930e59d71a3bf21d747c539c310b9825e7ddee0
Author: Rafal Rudnicki <[email protected]>
Date:   Mon Jul 21 14:26:48 2025 +0000

    implement umfPoolTrimMemory
---
bisect found first bad commit
  1. CI builds from the last good commit (https://github.com/oneapi-src/unified-memory-framework/commit/b56f6909276566084d11aa3847fa1ce5e39d0698): Weekly: https://github.com/ldorau/unified-memory-framework/actions/runs/18220204171 PR/push: https://github.com/ldorau/unified-memory-framework/actions/runs/18220204229 Nightly: https://github.com/ldorau/unified-memory-framework/actions/runs/18220204201

  2. CI builds from the first bad commit (https://github.com/oneapi-src/unified-memory-framework/commit/7930e59d71a3bf21d747c539c310b9825e7ddee0): Weekly: https://github.com/ldorau/unified-memory-framework/actions/runs/18220232952 PR/push: https://github.com/ldorau/unified-memory-framework/actions/runs/18220232917 Nightly: https://github.com/ldorau/unified-memory-framework/actions/runs/18220232918

ldorau avatar Oct 03 '25 10:10 ldorau