threadpoolctl icon indicating copy to clipboard operation
threadpoolctl copied to clipboard

Set number of threads for arrow

Open ivirshup opened this issue 1 year ago • 11 comments

Would setting the number of threads used by arrow be in-scope for this library?

(main docs on arrow thread pools)

arrow uses environment variables to set the numbers of threads used at import time, but then allows dynamically changing the number of threads used via setter functions, like set_cpu_count. Notably, there are two separate thread pools used one for compute and one for IO.

Is this functionality in scope for this library? If so, it would be great to see this feature.

ivirshup avatar Mar 23 '23 14:03 ivirshup

Hi @ivirshup, I think that arrow doesn't implement its own threadpool but instead relies on OpenMP for that. So I think controlling the number of OpenMP threads should work:

from threadpoolctl import ThreadpoolController
controller = ThreadpoolController()

with controller.limit(limits=1, user_api='openmp'):
    ...

jeremiedbb avatar Jun 30 '23 13:06 jeremiedbb

Thanks for the response. I'm not sure what the specific implementation is, but that example doesn't seem to set the number of threads pyarrow sees. I'll demonstrate:

Using threadpoolctl after pyarrow import

import pyarrow as pa
print(pa.cpu_count())

from threadpoolctl import ThreadpoolController
controller = ThreadpoolController()

with controller.limit(limits=1, user_api="openmp"):
    print(pa.cpu_count())
16
16

Using threadpoolctl during pyarrow import

from threadpoolctl import ThreadpoolController
controller = ThreadpoolController()

with controller.limit(limits=1, user_api="openmp"):
    import pyarrow as pa
    print(pa.cpu_count())
16

Setting OMP_NUM_THREADS

import os
os.environ["OMP_NUM_THREADS"] = "1"

import pyarrow as pa
print(pa.cpu_count())
1

ivirshup avatar Jun 30 '23 14:06 ivirshup

Right, I misinterpreted their doc. I looked into their source code and it appears that they implement their own threadpool, which can be configured by the OMP_NUM_THREADS env var even though it's usually used to control OpenMP threadpool.

I'm not sure yet if we want to explicitly support arrow. An alternative would be to allow custom controllers as requested here https://github.com/joblib/threadpoolctl/issues/137.

jeremiedbb avatar Jun 30 '23 14:06 jeremiedbb

An alternative would be to allow custom controllers as requested here #137.

I believe I prompted that 😆

ivirshup avatar Jun 30 '23 15:06 ivirshup

@ivirshup #138 was merged in the master branch. Feel free to give it a shot to see it's enough for arrow.

If filename-based dynlib matching we could extend it to complement the filename match with a symbol name match as discussed in https://github.com/joblib/threadpoolctl/pull/138#discussion_r1259383124 but this is not yet implemented.

ogrisel avatar Jul 11 '23 17:07 ogrisel

Great! Thanks @ogrisel and @jeremiedbb!

I'm a little unfamiliar with linking, as I've avoided learning much C++, but have given this a shot. It seems to work, but there's something a little strange going on. Here's what I've written:

import threadpoolctl, pyarrow as pa

class ArrowThreadPoolCtlController(threadpoolctl.LibController):
    user_api = "arrow"
    internal_api = "arrow"

    filename_prefixes = ("libarrow",)

    def get_num_threads(self):
        print(f"got {pa.cpu_count()} threads")
        return pa.cpu_count()

    def set_num_threads(self, num_threads):
        print(f"set to {num_threads} threads")
        pa.set_cpu_count(num_threads)

    def get_version(self):
        print("get_version called")
        return pa.__version__

    def set_additional_attributes(self):
        pass

threadpoolctl.register(ArrowThreadPoolCtlController)

with threadpoolctl.threadpool_limits(1):
    print(pa.cpu_count())

Here's the output:

get_version called
get_version called
get_version called
get_version called
got 16 threads
got 16 threads
got 16 threads
got 16 threads
set to 1 threads
set to 1 threads
set to 1 threads
set to 1 threads
1
set to 16 threads
set to 16 threads
set to 16 threads
set to 16 threads

This is from running it just once. This increases each time I register the class, so it could be nice if there was some level of uniqueness for controllers.

Maybe this has to do with the number of dynlibs that start with the prefix? This was run in a conda environment which has these dylibs:

./lib/python3.10/site-packages/pyarrow/libarrow_acero.1200.dylib
./lib/python3.10/site-packages/pyarrow/libarrow_dataset.1200.dylib
./lib/python3.10/site-packages/pyarrow/libarrow.1200.dylib
./lib/python3.10/site-packages/pyarrow/libarrow_python_flight.dylib
./lib/python3.10/site-packages/pyarrow/libparquet.1200.dylib
./lib/python3.10/site-packages/pyarrow/libarrow_python.dylib
./lib/python3.10/site-packages/pyarrow/libarrow_substrait.1200.dylib
./lib/python3.10/site-packages/pyarrow/libarrow_flight.1200.dylib

ivirshup avatar Jul 12 '23 09:07 ivirshup

Ah, I think I'm starting to see. I think I'm getting a dynlib for all matching files as the expectation is that I am setting the threads directly using the dynlib CDLL object.

I'm not sure I'm going to figure out how to do that. Maybe it could be done by calling the C++ methods for setting threads. I think just using "libarrow." as a signal where I then use the python interface and hope they are referring to the same dynlibs should work for my cases.

ivirshup avatar Jul 12 '23 11:07 ivirshup

The purpose of threadpoolctl is to make it easy to control the threadpools of native libraries that don't usually have python bindings. When python bindings for the library exist, I'd advise to use them directly. For your use case I'd simply do:

from contextlib import contextmanager

@contextmanager
def limit_arrow(num_threads):
    old_num_threads = pa.cpu_count()
    try:
        pa.set_cpu_count(num_threads)
        yield
    finally:
        pa.set_cpu_count(old_num_threads)


with limit_arrow(1):
    ...

jeremiedbb avatar Jul 12 '23 13:07 jeremiedbb

That being said, I think it would still be interesting to support arrow directly. For instance threadpoolctl provides a way to limit all supported libraries at once. Not having to write custom context managers for all libraries is nice.

I've tried to use the symbols from the shared object but there's a catch. arrow being a c++ library, symbol names are mangled :(

nm /home/jeremie/miniforge/envs/tmp2/lib/libarrow.so.1200.1.0 | grep "GetCpuThreadPoolCapacity"
00000000005ea990 T _ZN5arrow24GetCpuThreadPoolCapacityEv

nm --demangle /home/jeremie/miniforge/envs/tmp2/lib/libarrow.so.1200.1.0 | grep "GetCpuThreadPoolCapacity"
00000000005ea990 T arrow::GetCpuThreadPoolCapacity()

There are ways to demangle the name but it's gonna require some work to implement it in a robust and cross-platform way.

jeremiedbb avatar Jul 12 '23 13:07 jeremiedbb

Not having to write custom context managers for all libraries is nice.

Yeah, this is really what I like about this library!

native libraries that don't usually have python bindings.

So my concern about where calling pyarrow wouldn't work is if I was calling some other program that calls out to pyarrow.compute. If I either don't have pyarrow in this environment, or this program is using a bundled/ separate version of arrow, the pyarrow approach doesn't work.

Maybe arrow devs would have interest in supporting this?

ivirshup avatar Jul 13 '23 09:07 ivirshup

ping @jorisvandenbossche, we'd like to have your opinion on that :)

We're interested in adding support for arrow in threapoolctl but I'm facing some issues. The way threadpoolctl works is by searching and loading the shared library and try to call the symbols responsible to control the number of threads. In arrow these symbols seems to be GetCpuThreadPoolCapacity and SetCpuThreadPoolCapacity.

The issue is that since arrow is a c++ library, the names of the symbols are mangled, see https://github.com/joblib/threadpoolctl/issues/134#issuecomment-1632549685, making it hard to retrieve for threadpoolctl. I can see 3 alternatives:

  • arrow exports these symbols as C functions: extern "C" int arrow::GetCpuThreadPoolCapacity(), but arrow people might not want to do that :smile: and even though it does not completely guarantee that the name of the symbol won't change at all (it could acquire a leading underscore for instance).

  • threadpoolctl implements a mechanism to try to demangle the name by looking at the list of all symbols in the dso and try to match the mangled names with the one we're looking for. It will be very tricky to make it work consistently on all platforms.

  • the latest version of threadpoolctl allows third party developpers to implement and register a custom controller for their library. You can see an attempt at writing such a controller for arrow, through pyarrow, here https://github.com/joblib/threadpoolctl/issues/134#issuecomment-1632205706. Here the controller does not rely on the c++ symbols but on their python bindings instead. We can't do that in threadpoolctl because we don't want to have a dependency on pyarrow. Do you think the devs of pyarrow would be willing to implement and register an official controller for arrow ? An issue with that, mentionned here https://github.com/joblib/threadpoolctl/issues/134#issuecomment-1633881699, is that if another lib bundles arrow, it will need to register its own controller.

jeremiedbb avatar Jul 13 '23 16:07 jeremiedbb