p4p icon indicating copy to clipboard operation
p4p copied to clipboard

OSX + Python 3.13 CI crashes

Open AlexanderWells-diamond opened this issue 5 months ago • 14 comments

When trying to add CI for Python3.13 to all platforms as part of PR #170 , OSX hangs with the following error in the "Native PY Build" stage:

testValue (p4p.test.test_client_raw.TestRequest.testValue) ... ok
test_empty (p4p.test.test_gw.TestGC.test_empty) ... 2025-07-14T11:23:30.137659000 WARN pvxs.client.setup
     ignoring EPICS_PVA_BROADCAST_PORT=0
ok
test_server (p4p.test.test_gw.TestGC.test_server) ... 2025-07-14T11:23:30.151199000 WARN pvxs.client.setup
     ignoring EPICS_PVA_BROADCAST_PORT=0
ok
test_get (p4p.test.test_gw.TestHighLevel.test_get) ... pthread_attr_destroy  ERROR Invalid argument
free_threadInfoThread non-EPICS_6223228928 (0x600001b543c0) can't proceed, suspending.
Dumping a stack trace of thread 'non-EPICS_6223228928':
[       0x104cfec68]: /Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages/epicscorelibs/lib/libCom.7.0.7.99.1.dylib(epicsStackTrace+0x74)
[       0x104d15ed0]: /Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages/epicscorelibs/lib/libCom.7.0.7.99.1.dylib(cantProceed+0x44)
[       0x104d023b8]: /Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages/epicscorelibs/lib/libCom.7.0.7.99.1.dylib(free_threadInfo+0x14c)
[       0x18ed68870]: /usr/lib/system/libsystem_pthread.dylib(_pthread_tsd_cleanup+0x1e8)
[       0x18ed6b684]: /usr/lib/system/libsystem_pthread.dylib(_pthread_exit+0x54)
[       0x18ed6afa0]: /usr/lib/system/libsystem_pthread.dylib(_pthread_start+0x94)
[       0x18ed65d34]: /usr/lib/system/libsystem_pthread.dylib(thread_start+0x8)

Unfortunately there's no more immediately available information; the ci-core-dumper doesn't print any extra information. An example of a failed run is here.

I removed this platform from the matrix in order to test and roll out an alpha release for this project + the other related EPICS Python projects.

AlexanderWells-diamond avatar Jul 14 '25 12:07 AlexanderWells-diamond

Well, this is strange. As far as I can tell, the only other mention of this error is #156. That this is coming from a non-EPICS thread seems significant.

imo. This seems likely to be Base bug. At a glance, it looks like epicsThreadOSD::attr is not explicitly initialized by createImplicit(). This will leave it as zeroed by calloc(). I would naively think that this was ok, but perhaps not.

@anjohnson @simon-ess Could one of you try to manually run nonEpicsThreadPriorityTest on OSX?

mdavidsaver avatar Jul 14 '25 17:07 mdavidsaver

Could one of you try to manually run nonEpicsThreadPriorityTest on OSX?

The Makefile is explicitly only building that test on Linux.

ifeq ($(OS_CLASS),Linux)
ifeq ($(USE_POSIX_THREAD_PRIORITY_SCHEDULING),YES)
TESTPROD_HOST += nonEpicsThreadPriorityTest
nonEpicsThreadPriorityTest_SRCS += nonEpicsThreadPriorityTest.cpp
nonEpicsThreadPriorityTest_SYS_LIBS += $(POSIX_LDLIBS:-l%=%)
testHarness_SRCS += nonEpicsThreadPriorityTest.cpp
epicsRunLibComTests_CFLAGS += -DHAVE_PTHREAD_PRIORITY_SCHEDULING
TESTS += nonEpicsThreadPriorityTest
endif
endif

Removing that build condition gives this on MacOS:

Kato$ ./nonEpicsThreadPriorityTest 
1..2
ok  1 # SKIP nonEpicsThreadPriorityTest must be executed with privileges to use SCHED_FIFO
ok  2 - epicsEventWait returned 0

    Results
    =======
       Tests: 2  
      Passed:   2 = 100.00%
     Skipped:   1 = 50.00%

Apparently MacOS doesn't support priority scheduling; running this in sudo:

Kato$ sudo $PWD/epicsThreadPriorityTest
Password:
1..7
ok  1 - task 0x6000032f40c0 epicsEventWait returned 0
ok  2 - epicsEventWaitWithTimeout returned 0
ok  3 - task 0x6000032f40c0 epicsEventWait returned 0
ok  4 - epicsEventWaitWithTimeout returned 0
# No strict priority scheduler
ok  5 - task 0x6000032f40c0 epicsEventWait returned 0
ok  6 - epicsEventWaitWithTimeout returned 0
ok  7 - epicsEventWait returned 0

    Results
    =======
       Tests: 7  
      Passed:   7 = 100.00%

anjohnson avatar Jul 14 '25 18:07 anjohnson

Removing that build condition gives this on MacOS:

Kato$ ./nonEpicsThreadPriorityTest 
1..2
ok  1 # SKIP nonEpicsThreadPriorityTest must be executed with privileges to use SCHED_FIFO

Well, so much for the easy route... This suggests to me that some cpython 3.13 regression is involved. Perhaps related to changes around the "free threading" feature?

I removed this platform from the matrix in order to test and roll out an alpha release for this project + the other related EPICS Python projects.

This seems reasonable to me. Unless (until?) some related issue is found on Linux, then I think that OSX support with py >= 3.13 will need to wait for some interested person to run the P4P tests locally and investigate this situation.

mdavidsaver avatar Jul 20 '25 21:07 mdavidsaver

I have started looking at this issue, but I am struggling a little bit with it. The issue os x issue happens during the gateway teardown, specifically. the call to self.tearDownGW() in tearDown() (in the class TestHighLevel). On linux and os x, the behaviour of this line is what differs: https://github.com/epics-base/epics-base/blob/9ce921f4fdc954205fbafdd2173345c2b8b090a2/modules/libcom/src/osi/os/posix/osdThread.c#L234

static void free_threadInfo(void* raw)
{
    epicsThreadOSD *pthreadInfo = raw;
    int status;

    if(epicsAtomicDecrIntT(&pthreadInfo->refcnt) > 0) return;

    status = mutexLock(&listLock);
    checkStatusQuit(status,"pthread_mutex_lock","free_threadInfo");
    if(pthreadInfo->isOnThreadList) ellDelete(&pthreadList,&pthreadInfo->node);
    status = pthread_mutex_unlock(&listLock);
    checkStatusQuit(status,"pthread_mutex_unlock","free_threadInfo");
    epicsEventDestroy(pthreadInfo->suspendEvent);
    status = pthread_attr_destroy(&pthreadInfo->attr);                       // <--------------- here
    checkStatusQuit(status,"pthread_attr_destroy","free_threadInfo");
    free(pthreadInfo);
}

On os x, we get a status of 22 here. Unfortunately, there isn't that much other information:

(lldb) bt
* thread #12, stop reason = breakpoint 3.1
  * frame #0: 0x0000000105bd4ab4 libCom.3.24.0.dylib`cantProceed(msg="free_threadInfo") at cantProceed.c:57:5
    frame #1: 0x0000000105be4d04 libCom.3.24.0.dylib`free_threadInfo(raw=0x000000014cf51d10) at osdThread.c:235:5
    frame #2: 0x0000000183874870 libsystem_pthread.dylib`_pthread_tsd_cleanup + 488
    frame #3: 0x0000000183877684 libsystem_pthread.dylib`_pthread_exit + 84
    frame #4: 0x0000000183876fa0 libsystem_pthread.dylib`_pthread_start + 148

The contents of the pthreadInfo->attr object aren't that illuminating: on darwin-aarch64:

print pthreadInfo->attr
(lldb) print pthreadInfo->attr
(pthread_attr_t)  (__sig = 0, __opaque = "")

On linux:

(gdb) print pthreadInfo->attr
$1 = {__size = '\000' <repeats 55 times>, __align = 0}

simon-ess avatar Sep 01 '25 14:09 simon-ess

The contents of the pthreadInfo->attr object aren't that illuminating: on darwin-aarch64:

Have you compared with an "epics" thread? Or with the pcommonAttr->attr global in osdThread.c?

I also wonder if the following change to Base would effect the situation:

diff --git a/modules/libcom/src/osi/os/posix/osdThread.c b/modules/libcom/src/osi/os/posix/osdThread.c
index db043dff6..9d46cd204 100644
--- a/modules/libcom/src/osi/os/posix/osdThread.c
+++ b/modules/libcom/src/osi/os/posix/osdThread.c
@@ -666,6 +666,8 @@ static epicsThreadOSD *createImplicit(void)
     pthreadInfo->tid = tid;
     pthreadInfo->osiPriority = 0;
     pthreadInfo->isOkToBlock = 1;
+    status = pthread_attr_init(&pthreadInfo->attr);
+    checkStatusOnce(status,"pthread_attr_init");
 
 #if defined(_POSIX_THREAD_PRIORITY_SCHEDULING) && _POSIX_THREAD_PRIORITY_SCHEDULING > 0
     if(pthread_getschedparam(tid,&pthreadInfo->schedPolicy,&pthreadInfo->schedParam) == 0) {

mdavidsaver avatar Sep 01 '25 23:09 mdavidsaver

@mdavidsaver - I can confirm that this change addresses this issue and stops the test from hanging on python 3.13.

simon-ess avatar Sep 03 '25 10:09 simon-ess

So the cause is an uninitialized pthread_attr_t. It is not clear to me why this had so far not caused a failure. Presumably some behind the scenes change by apple? It probably does not matter as the fix is clear.

mdavidsaver avatar Sep 03 '25 14:09 mdavidsaver

The surprising thing is that it appears uninitialised for both linux and os x; it's just that only os x seems to care about this, and only in python 3.13 and later does it hang. See for example this build log: https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=1320791&view=logs&j=c8b8640e-759f-5af3-65b5-809c078c71b1&t=afb740aa-a99a-548c-edee-7cc314e30345

You can see that we get the same exception on cleanup on many test cases, but they are silently ignored.

With this change the object is initialised properly for everyone and everything seems to be happy.

simon-ess avatar Sep 03 '25 14:09 simon-ess

... uninitialised for both linux and os x; it's just that only os x seems to care about this ...

The Linux side is easily understandable since we can read the implementations of pthread_attr_init() and pthread_attr_destroy(). eg. with recent glibc the init is mostly zeroing, and the destroy has no error checks.

The OSX implementations presumably have (added?) some kind of flag to mark a struct as initialized.

mdavidsaver avatar Sep 03 '25 14:09 mdavidsaver

Fix applied to Base by https://github.com/epics-base/epics-base/commit/2e26ec09a6304c9a58e143c61fef22e259830bbe

mdavidsaver avatar Sep 03 '25 15:09 mdavidsaver

I don't think this fix has been merged back into main? That commit is only in a fork.

AlexanderWells-diamond avatar Oct 01 '25 07:10 AlexanderWells-diamond

@AlexanderWells-diamond - the necessary fix has been merged into the 7.0 branch of EPICS base; building pvxs and p4p on top of that should work to fix this issue.

simon-ess avatar Oct 01 '25 07:10 simon-ess

Apologies, I wasn't clear - the fix is in EPICS Base but the CI for p4p is still currently disabled.

AlexanderWells-diamond avatar Oct 01 '25 07:10 AlexanderWells-diamond

You have a fair point, let's leave this ticket open at least until this fix lands in epicscorelibs.

mdavidsaver avatar Oct 02 '25 01:10 mdavidsaver