Release GVL around pw and gr function calls in Etc
These can block as they parse files.
I'm not sure if other function calls in Etc can block, so this doesn't add GVL releasing to them.
Releasing the GVL for these methods results in thread-safety issues, since thread-safety for the methods relied previously on the GVL. Add a mutex and surround GVL-releasing calls with a mutex.
This appears to fail on TruffleRuby due to a TruffleRuby bug.
The man page of macOS says:
All of these routines are thread-safe. The
getpwent(),getpwnam(), andgetpwuid()routines return a pointer to a result managed by the system library in a thread-specific data structure.
I'm not sure they are thread safe on all operating systems. OpenBSD doesn't appear to use a thread-safe data structure. The Linux man pages don't mention thread safety.
I'm not sure we have thread-safety tests/specs for getpwnam and such, but there is one for getgrgid, and it segfaulted consistently on Linux before I added the mutex synchronization:
https://github.com/jeremyevans/ruby/actions/runs/10000243457
MacOS getgrgid man page has similar thread-safety language. Would you like like mutex synchronization turned off for MacOS?
The TruffleRuby issue is https://github.com/oracle/truffleruby/issues/3624 it would be nice to wait for the fix before merging this, so at least it passes on truffleruby-head .
The CI matrix should also be made fail-fast: false.
As an aside, the signature of rb_mutex_synchronize in Ruby headers is quite wrong:
VALUE rb_mutex_synchronize(VALUE mutex, VALUE (*func)(VALUE arg), VALUE arg);
but it should be:
void* rb_mutex_synchronize(VALUE mutex, void* (*func)(void* arg), void* arg);
Could we fix that in Ruby headers?
I found the GVL releasing caused significant thread-safety issues on platforms where the underlying functions are not thread-safe. However, I was only able to trigger the issues by manually adding rb_thread_schedule calls after the mutexes were released. I pushed a commit to fix the thread-safety issues by holding the mutex while creating the related Ruby objects to be returned. I also added thread-safety test, and support for a -DTEST_THREAD_SAFETY cflag to deliberately trying to expose thread-safety issues using rb_thread_schedule. All added thread-safety tests pass with -DTEST_THREAD_SAFETY.
As an aside, the signature of
rb_mutex_synchronizein Ruby headers is quite wrong:VALUE rb_mutex_synchronize(VALUE mutex, VALUE (*func)(VALUE arg), VALUE arg);but it should be:
void* rb_mutex_synchronize(VALUE mutex, void* (*func)(void* arg), void* arg);Could we fix that in Ruby headers?
I think this is likely to introduce int to pointer conversion compiler warnings. Maybe we could add a new function for the void * version?
https://github.com/oracle/truffleruby/issues/3624 is fixed, so once https://github.com/ruby/truffleruby-dev-builder/actions/runs/10057958196 is done it should pass on truffleruby-head.
I just thought of an alternative to rb_mutex_synchronize: to use the rb_nativethread_lock* APIs of https://github.com/ruby/ruby/blob/57b11be15ae518288b719fb36068ceb23da6e050/include/ruby/thread_native.h#L88.
This feels a bit nicer than using a Ruby Mutex around a call releasing the GVL, and avoids running into https://github.com/oracle/truffleruby/issues/3624 for existing TruffleRuby releases.
It also removes the need for all these casts.
The fix in TruffleRuby is merged and should be available in the next dev build (I suppose tomorrow?) cc @eregon
I just thought of an alternative to
rb_mutex_synchronize: to use therb_nativethread_lock*APIs of https://github.com/ruby/ruby/blob/57b11be15ae518288b719fb36068ceb23da6e050/include/ruby/thread_native.h#L88. This feels a bit nicer than using a Ruby Mutex around a call releasing the GVL, and avoids running into oracle/truffleruby#3624 for existing TruffleRuby releases. It also removes the need for all these casts.
I tried this approach. You need to wrap rb_nativethread_lock_lock in rb_thread_call_without_gvl to avoid deadlock, and you still need some casts. I pushed a commit for it, but I'm not sure whether it is nicer. It does appear to pass on TruffleRuby, though.
After some more thought, I came to the determination that the approach of releasing the GVL and using a mutex is not safe. It works fine if all calls to get{pwuid,pwnam,grgid,grnam} originate in etc.c. However, Ruby core in process.c and file.c can call the functions, and concurrent calls could result in segfaults.
I've pushed a commit to drop the mutex usage, retain the GVL for get{pwuid,pwnam,grgid,grnam} calls, and add support for calling get{pwuid,pwnam,grgid,grnam}_r instead, using the implementations from process.c (specifically, the implementations in https://github.com/ruby/ruby/pull/11202, which adds GVL releasing to these function calls).