OSX + Python 3.13 CI crashes
When trying to add CI for Python3.13 to all platforms as part of PR #170 , OSX hangs with the following error in the "Native PY Build" stage:
testValue (p4p.test.test_client_raw.TestRequest.testValue) ... ok
test_empty (p4p.test.test_gw.TestGC.test_empty) ... 2025-07-14T11:23:30.137659000 WARN pvxs.client.setup
ignoring EPICS_PVA_BROADCAST_PORT=0
ok
test_server (p4p.test.test_gw.TestGC.test_server) ... 2025-07-14T11:23:30.151199000 WARN pvxs.client.setup
ignoring EPICS_PVA_BROADCAST_PORT=0
ok
test_get (p4p.test.test_gw.TestHighLevel.test_get) ... pthread_attr_destroy ERROR Invalid argument
free_threadInfoThread non-EPICS_6223228928 (0x600001b543c0) can't proceed, suspending.
Dumping a stack trace of thread 'non-EPICS_6223228928':
[ 0x104cfec68]: /Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages/epicscorelibs/lib/libCom.7.0.7.99.1.dylib(epicsStackTrace+0x74)
[ 0x104d15ed0]: /Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages/epicscorelibs/lib/libCom.7.0.7.99.1.dylib(cantProceed+0x44)
[ 0x104d023b8]: /Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages/epicscorelibs/lib/libCom.7.0.7.99.1.dylib(free_threadInfo+0x14c)
[ 0x18ed68870]: /usr/lib/system/libsystem_pthread.dylib(_pthread_tsd_cleanup+0x1e8)
[ 0x18ed6b684]: /usr/lib/system/libsystem_pthread.dylib(_pthread_exit+0x54)
[ 0x18ed6afa0]: /usr/lib/system/libsystem_pthread.dylib(_pthread_start+0x94)
[ 0x18ed65d34]: /usr/lib/system/libsystem_pthread.dylib(thread_start+0x8)
Unfortunately there's no more immediately available information; the ci-core-dumper doesn't print any extra information. An example of a failed run is here.
I removed this platform from the matrix in order to test and roll out an alpha release for this project + the other related EPICS Python projects.
Well, this is strange. As far as I can tell, the only other mention of this error is #156. That this is coming from a non-EPICS thread seems significant.
imo. This seems likely to be Base bug. At a glance, it looks like epicsThreadOSD::attr is not explicitly initialized by createImplicit(). This will leave it as zeroed by calloc(). I would naively think that this was ok, but perhaps not.
@anjohnson @simon-ess Could one of you try to manually run nonEpicsThreadPriorityTest on OSX?
Could one of you try to manually run nonEpicsThreadPriorityTest on OSX?
The Makefile is explicitly only building that test on Linux.
ifeq ($(OS_CLASS),Linux)
ifeq ($(USE_POSIX_THREAD_PRIORITY_SCHEDULING),YES)
TESTPROD_HOST += nonEpicsThreadPriorityTest
nonEpicsThreadPriorityTest_SRCS += nonEpicsThreadPriorityTest.cpp
nonEpicsThreadPriorityTest_SYS_LIBS += $(POSIX_LDLIBS:-l%=%)
testHarness_SRCS += nonEpicsThreadPriorityTest.cpp
epicsRunLibComTests_CFLAGS += -DHAVE_PTHREAD_PRIORITY_SCHEDULING
TESTS += nonEpicsThreadPriorityTest
endif
endif
Removing that build condition gives this on MacOS:
Kato$ ./nonEpicsThreadPriorityTest
1..2
ok 1 # SKIP nonEpicsThreadPriorityTest must be executed with privileges to use SCHED_FIFO
ok 2 - epicsEventWait returned 0
Results
=======
Tests: 2
Passed: 2 = 100.00%
Skipped: 1 = 50.00%
Apparently MacOS doesn't support priority scheduling; running this in sudo:
Kato$ sudo $PWD/epicsThreadPriorityTest
Password:
1..7
ok 1 - task 0x6000032f40c0 epicsEventWait returned 0
ok 2 - epicsEventWaitWithTimeout returned 0
ok 3 - task 0x6000032f40c0 epicsEventWait returned 0
ok 4 - epicsEventWaitWithTimeout returned 0
# No strict priority scheduler
ok 5 - task 0x6000032f40c0 epicsEventWait returned 0
ok 6 - epicsEventWaitWithTimeout returned 0
ok 7 - epicsEventWait returned 0
Results
=======
Tests: 7
Passed: 7 = 100.00%
Removing that build condition gives this on MacOS:
Kato$ ./nonEpicsThreadPriorityTest 1..2 ok 1 # SKIP nonEpicsThreadPriorityTest must be executed with privileges to use SCHED_FIFO
Well, so much for the easy route... This suggests to me that some cpython 3.13 regression is involved. Perhaps related to changes around the "free threading" feature?
I removed this platform from the matrix in order to test and roll out an alpha release for this project + the other related EPICS Python projects.
This seems reasonable to me. Unless (until?) some related issue is found on Linux, then I think that OSX support with py >= 3.13 will need to wait for some interested person to run the P4P tests locally and investigate this situation.
I have started looking at this issue, but I am struggling a little bit with it. The issue os x issue happens during the gateway teardown, specifically. the call to self.tearDownGW() in tearDown() (in the class TestHighLevel). On linux and os x, the behaviour of this line is what differs: https://github.com/epics-base/epics-base/blob/9ce921f4fdc954205fbafdd2173345c2b8b090a2/modules/libcom/src/osi/os/posix/osdThread.c#L234
static void free_threadInfo(void* raw)
{
epicsThreadOSD *pthreadInfo = raw;
int status;
if(epicsAtomicDecrIntT(&pthreadInfo->refcnt) > 0) return;
status = mutexLock(&listLock);
checkStatusQuit(status,"pthread_mutex_lock","free_threadInfo");
if(pthreadInfo->isOnThreadList) ellDelete(&pthreadList,&pthreadInfo->node);
status = pthread_mutex_unlock(&listLock);
checkStatusQuit(status,"pthread_mutex_unlock","free_threadInfo");
epicsEventDestroy(pthreadInfo->suspendEvent);
status = pthread_attr_destroy(&pthreadInfo->attr); // <--------------- here
checkStatusQuit(status,"pthread_attr_destroy","free_threadInfo");
free(pthreadInfo);
}
On os x, we get a status of 22 here. Unfortunately, there isn't that much other information:
(lldb) bt
* thread #12, stop reason = breakpoint 3.1
* frame #0: 0x0000000105bd4ab4 libCom.3.24.0.dylib`cantProceed(msg="free_threadInfo") at cantProceed.c:57:5
frame #1: 0x0000000105be4d04 libCom.3.24.0.dylib`free_threadInfo(raw=0x000000014cf51d10) at osdThread.c:235:5
frame #2: 0x0000000183874870 libsystem_pthread.dylib`_pthread_tsd_cleanup + 488
frame #3: 0x0000000183877684 libsystem_pthread.dylib`_pthread_exit + 84
frame #4: 0x0000000183876fa0 libsystem_pthread.dylib`_pthread_start + 148
The contents of the pthreadInfo->attr object aren't that illuminating: on darwin-aarch64:
print pthreadInfo->attr
(lldb) print pthreadInfo->attr
(pthread_attr_t) (__sig = 0, __opaque = "")
On linux:
(gdb) print pthreadInfo->attr
$1 = {__size = '\000' <repeats 55 times>, __align = 0}
The contents of the
pthreadInfo->attrobject aren't that illuminating: on darwin-aarch64:
Have you compared with an "epics" thread? Or with the pcommonAttr->attr global in osdThread.c?
I also wonder if the following change to Base would effect the situation:
diff --git a/modules/libcom/src/osi/os/posix/osdThread.c b/modules/libcom/src/osi/os/posix/osdThread.c
index db043dff6..9d46cd204 100644
--- a/modules/libcom/src/osi/os/posix/osdThread.c
+++ b/modules/libcom/src/osi/os/posix/osdThread.c
@@ -666,6 +666,8 @@ static epicsThreadOSD *createImplicit(void)
pthreadInfo->tid = tid;
pthreadInfo->osiPriority = 0;
pthreadInfo->isOkToBlock = 1;
+ status = pthread_attr_init(&pthreadInfo->attr);
+ checkStatusOnce(status,"pthread_attr_init");
#if defined(_POSIX_THREAD_PRIORITY_SCHEDULING) && _POSIX_THREAD_PRIORITY_SCHEDULING > 0
if(pthread_getschedparam(tid,&pthreadInfo->schedPolicy,&pthreadInfo->schedParam) == 0) {
@mdavidsaver - I can confirm that this change addresses this issue and stops the test from hanging on python 3.13.
So the cause is an uninitialized pthread_attr_t. It is not clear to me why this had so far not caused a failure. Presumably some behind the scenes change by apple? It probably does not matter as the fix is clear.
The surprising thing is that it appears uninitialised for both linux and os x; it's just that only os x seems to care about this, and only in python 3.13 and later does it hang. See for example this build log: https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=1320791&view=logs&j=c8b8640e-759f-5af3-65b5-809c078c71b1&t=afb740aa-a99a-548c-edee-7cc314e30345
You can see that we get the same exception on cleanup on many test cases, but they are silently ignored.
With this change the object is initialised properly for everyone and everything seems to be happy.
... uninitialised for both linux and os x; it's just that only os x seems to care about this ...
The Linux side is easily understandable since we can read the implementations of pthread_attr_init() and pthread_attr_destroy(). eg. with recent glibc the init is mostly zeroing, and the destroy has no error checks.
The OSX implementations presumably have (added?) some kind of flag to mark a struct as initialized.
Fix applied to Base by https://github.com/epics-base/epics-base/commit/2e26ec09a6304c9a58e143c61fef22e259830bbe
I don't think this fix has been merged back into main? That commit is only in a fork.
@AlexanderWells-diamond - the necessary fix has been merged into the 7.0 branch of EPICS base; building pvxs and p4p on top of that should work to fix this issue.
Apologies, I wasn't clear - the fix is in EPICS Base but the CI for p4p is still currently disabled.
You have a fair point, let's leave this ticket open at least until this fix lands in epicscorelibs.