arrow
arrow copied to clipboard
ARROW-17327 [Python] Parquet should be listed in PyArrow's get_libraries() function
I tried to come up with a suitable test for this, but we only run into the issue in PyMongoArrow because we are explicitly including binary files by path instead of relying on ext.library_dirs as is done in test_cython.py.
https://issues.apache.org/jira/browse/ARROW-17327
:warning: Ticket has not been started in JIRA, please click 'Start Progress'.
I'm a bit surprised. How are you using PyArrow exactly? Are you trying to link to Parquet C++ APIs?
Oh, I see: libarrow_python links to libparquet for Parquet encryption support.
That said, I wonder why the transitive dependency isn't picked up automatically.
We are vendoring the specific arrow binary files by path, copying them into our build directory. We do this because we use them at runtime for the array builders.
We use the list given by get_libraries() to select which files to vendor specifically using module.extra_link_args.append(path) instead of module.libraries.append('arrow').
Our logic is here.
Hmm, I'm not sure that's the intended use for get_libraries. Basically, get_libraries allows you to link your C++ or Cython code to the Arrow Python libraries, but it doesn't list transitive dependencies. It also doesn't capture additional Arrow libraries such as libarrow_flight, libgandiva...
@jorisvandenbossche @xhochy What do you think?
We are linking Cython code FWIW. I can try to refactor to use module.libraries instead of module.extra_link_args.
Yes, but the part where get_libraries() isn't enough is because you are also copying the DLL files instead of simply distributing the entire PyArrow package, right?
Yes, we're using get_libraries() as a proxy for the names of files to copy over, optionally with the version string in the file name.
I tried using the approach from the Cython example, but it is failing during auditwheel: "ValueError: Cannot repair wheel, because required library "libarrow_python.so.900" could not be located".
Okay, I was able to use the recommended approach along with wheel repair, with one interesting snag. Because I am setting DYLD_LIBRARY_PATH before calling delocate on macos, the pyarrow._json module overshadows the optional one in the stdlib:
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/scanner.py", line 5, in <module>
from _json import make_scanner as c_make_scanner
File "pyarrow/_json.pyx", line 1, in init pyarrow._json
File "/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/cibw-run-d7ri7rrm/cp310-macosx_x86_64/build/venv/lib/python3.10/site-packages/pyarrow/__init__.py", line 65, in <module>
I was able to work around it my making sure that the appropriate lib-dynload folder was ahead of the pyarrow folder on DYLD_LIBRARY_PATH, but perhaps the module should be renamed in pyarrow? I think we should close this pull request in favor of one that renames pyarrow._json.
I'd also be happy to create an integration test where we build a wheel and repair it across the three platforms as part of that new issue.
Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍
Can you share how you solved it, @blink1073 ? I am getting this same error and got to the end of my rope here.
https://github.com/AequilibraE/aequilibrae/actions/runs/7985055367/job/21802866277
Hi @pedrocamargo, we ended up using this pattern:
import numpy as np
import pyarrow as pa
# From https://arrow.apache.org/docs/python/integration/extending.html#example
# The Numpy C headers are currently required
ext.include_dirs.append(np.get_include())
ext.include_dirs.append(pa.get_include())
ext.libraries.extend(pa.get_libraries())
ext.library_dirs.extend(pa.get_library_dirs())
if os.name != "nt" and CREATE_LIBARROW_SYMLINKS:
# On Linux and MacOS, we must run pyarrow.create_library_symlinks()
# as a user with write access to the directory where pyarrow is
# installed.
# See https://arrow.apache.org/docs/python/integration/extending.html#building-extensions-against-pypi-wheels.
pa.create_library_symlinks()
if os.name == "posix":
ext.extra_compile_args.append("-std=c++17")
elif os.name == "nt":
ext.extra_compile_args.append("/std:c++17")
We also don't "repair" our wheel, other than to add appropriate tags on Linux.
Hi @pedrocamargo, we ended up using this pattern:
import numpy as np import pyarrow as pa # From https://arrow.apache.org/docs/python/integration/extending.html#example # The Numpy C headers are currently required ext.include_dirs.append(np.get_include()) ext.include_dirs.append(pa.get_include()) ext.libraries.extend(pa.get_libraries()) ext.library_dirs.extend(pa.get_library_dirs()) if os.name != "nt" and CREATE_LIBARROW_SYMLINKS: # On Linux and MacOS, we must run pyarrow.create_library_symlinks() # as a user with write access to the directory where pyarrow is # installed. # See https://arrow.apache.org/docs/python/integration/extending.html#building-extensions-against-pypi-wheels. pa.create_library_symlinks() if os.name == "posix": ext.extra_compile_args.append("-std=c++17") elif os.name == "nt": ext.extra_compile_args.append("/std:c++17")We also don't "repair" our wheel, other than to add appropriate tags on Linux.
That's helpful, thanks. I am admittedly out of my depth here, but the step we are failing at is exactly the repair stage. I'll try your route and see if the wheel still works as expected on other Linux distributions. Thanks!!