threadpoolctl
threadpoolctl copied to clipboard
Set number of threads for arrow
Would setting the number of threads used by arrow be in-scope for this library?
(main docs on arrow thread pools)
arrow uses environment variables to set the numbers of threads used at import time, but then allows dynamically changing the number of threads used via setter functions, like set_cpu_count
. Notably, there are two separate thread pools used one for compute and one for IO.
Is this functionality in scope for this library? If so, it would be great to see this feature.
Hi @ivirshup, I think that arrow doesn't implement its own threadpool but instead relies on OpenMP for that. So I think controlling the number of OpenMP threads should work:
from threadpoolctl import ThreadpoolController
controller = ThreadpoolController()
with controller.limit(limits=1, user_api='openmp'):
...
Thanks for the response. I'm not sure what the specific implementation is, but that example doesn't seem to set the number of threads pyarrow sees. I'll demonstrate:
Using threadpoolctl after pyarrow import
import pyarrow as pa
print(pa.cpu_count())
from threadpoolctl import ThreadpoolController
controller = ThreadpoolController()
with controller.limit(limits=1, user_api="openmp"):
print(pa.cpu_count())
16
16
Using threadpoolctl during pyarrow import
from threadpoolctl import ThreadpoolController
controller = ThreadpoolController()
with controller.limit(limits=1, user_api="openmp"):
import pyarrow as pa
print(pa.cpu_count())
16
Setting OMP_NUM_THREADS
import os
os.environ["OMP_NUM_THREADS"] = "1"
import pyarrow as pa
print(pa.cpu_count())
1
Right, I misinterpreted their doc. I looked into their source code and it appears that they implement their own threadpool, which can be configured by the OMP_NUM_THREADS
env var even though it's usually used to control OpenMP threadpool.
I'm not sure yet if we want to explicitly support arrow. An alternative would be to allow custom controllers as requested here https://github.com/joblib/threadpoolctl/issues/137.
An alternative would be to allow custom controllers as requested here
#137
.
I believe I prompted that 😆
@ivirshup #138 was merged in the master
branch. Feel free to give it a shot to see it's enough for arrow.
If filename-based dynlib matching we could extend it to complement the filename match with a symbol name match as discussed in https://github.com/joblib/threadpoolctl/pull/138#discussion_r1259383124 but this is not yet implemented.
Great! Thanks @ogrisel and @jeremiedbb!
I'm a little unfamiliar with linking, as I've avoided learning much C++, but have given this a shot. It seems to work, but there's something a little strange going on. Here's what I've written:
import threadpoolctl, pyarrow as pa
class ArrowThreadPoolCtlController(threadpoolctl.LibController):
user_api = "arrow"
internal_api = "arrow"
filename_prefixes = ("libarrow",)
def get_num_threads(self):
print(f"got {pa.cpu_count()} threads")
return pa.cpu_count()
def set_num_threads(self, num_threads):
print(f"set to {num_threads} threads")
pa.set_cpu_count(num_threads)
def get_version(self):
print("get_version called")
return pa.__version__
def set_additional_attributes(self):
pass
threadpoolctl.register(ArrowThreadPoolCtlController)
with threadpoolctl.threadpool_limits(1):
print(pa.cpu_count())
Here's the output:
get_version called
get_version called
get_version called
get_version called
got 16 threads
got 16 threads
got 16 threads
got 16 threads
set to 1 threads
set to 1 threads
set to 1 threads
set to 1 threads
1
set to 16 threads
set to 16 threads
set to 16 threads
set to 16 threads
This is from running it just once. This increases each time I register the class, so it could be nice if there was some level of uniqueness for controllers.
Maybe this has to do with the number of dynlibs that start with the prefix? This was run in a conda environment which has these dylibs:
./lib/python3.10/site-packages/pyarrow/libarrow_acero.1200.dylib
./lib/python3.10/site-packages/pyarrow/libarrow_dataset.1200.dylib
./lib/python3.10/site-packages/pyarrow/libarrow.1200.dylib
./lib/python3.10/site-packages/pyarrow/libarrow_python_flight.dylib
./lib/python3.10/site-packages/pyarrow/libparquet.1200.dylib
./lib/python3.10/site-packages/pyarrow/libarrow_python.dylib
./lib/python3.10/site-packages/pyarrow/libarrow_substrait.1200.dylib
./lib/python3.10/site-packages/pyarrow/libarrow_flight.1200.dylib
Ah, I think I'm starting to see. I think I'm getting a dynlib for all matching files as the expectation is that I am setting the threads directly using the dynlib CDLL object.
I'm not sure I'm going to figure out how to do that. Maybe it could be done by calling the C++ methods for setting threads. I think just using "libarrow."
as a signal where I then use the python interface and hope they are referring to the same dynlibs should work for my cases.
The purpose of threadpoolctl
is to make it easy to control the threadpools of native libraries that don't usually have python bindings. When python bindings for the library exist, I'd advise to use them directly. For your use case I'd simply do:
from contextlib import contextmanager
@contextmanager
def limit_arrow(num_threads):
old_num_threads = pa.cpu_count()
try:
pa.set_cpu_count(num_threads)
yield
finally:
pa.set_cpu_count(old_num_threads)
with limit_arrow(1):
...
That being said, I think it would still be interesting to support arrow directly. For instance threadpoolctl provides a way to limit all supported libraries at once. Not having to write custom context managers for all libraries is nice.
I've tried to use the symbols from the shared object but there's a catch. arrow being a c++ library, symbol names are mangled :(
nm /home/jeremie/miniforge/envs/tmp2/lib/libarrow.so.1200.1.0 | grep "GetCpuThreadPoolCapacity"
00000000005ea990 T _ZN5arrow24GetCpuThreadPoolCapacityEv
nm --demangle /home/jeremie/miniforge/envs/tmp2/lib/libarrow.so.1200.1.0 | grep "GetCpuThreadPoolCapacity"
00000000005ea990 T arrow::GetCpuThreadPoolCapacity()
There are ways to demangle the name but it's gonna require some work to implement it in a robust and cross-platform way.
Not having to write custom context managers for all libraries is nice.
Yeah, this is really what I like about this library!
native libraries that don't usually have python bindings.
So my concern about where calling pyarrow wouldn't work is if I was calling some other program that calls out to pyarrow.compute. If I either don't have pyarrow in this environment, or this program is using a bundled/ separate version of arrow, the pyarrow approach doesn't work.
Maybe arrow devs would have interest in supporting this?
ping @jorisvandenbossche, we'd like to have your opinion on that :)
We're interested in adding support for arrow
in threapoolctl but I'm facing some issues. The way threadpoolctl works is by searching and loading the shared library and try to call the symbols responsible to control the number of threads. In arrow these symbols seems to be GetCpuThreadPoolCapacity
and SetCpuThreadPoolCapacity
.
The issue is that since arrow is a c++ library, the names of the symbols are mangled, see https://github.com/joblib/threadpoolctl/issues/134#issuecomment-1632549685, making it hard to retrieve for threadpoolctl. I can see 3 alternatives:
-
arrow exports these symbols as C functions:
extern "C" int arrow::GetCpuThreadPoolCapacity()
, but arrow people might not want to do that :smile: and even though it does not completely guarantee that the name of the symbol won't change at all (it could acquire a leading underscore for instance). -
threadpoolctl implements a mechanism to try to demangle the name by looking at the list of all symbols in the dso and try to match the mangled names with the one we're looking for. It will be very tricky to make it work consistently on all platforms.
-
the latest version of threadpoolctl allows third party developpers to implement and register a custom controller for their library. You can see an attempt at writing such a controller for arrow, through pyarrow, here https://github.com/joblib/threadpoolctl/issues/134#issuecomment-1632205706. Here the controller does not rely on the c++ symbols but on their python bindings instead. We can't do that in threadpoolctl because we don't want to have a dependency on pyarrow. Do you think the devs of pyarrow would be willing to implement and register an official controller for arrow ? An issue with that, mentionned here https://github.com/joblib/threadpoolctl/issues/134#issuecomment-1633881699, is that if another lib bundles arrow, it will need to register its own controller.