arrow
arrow copied to clipboard
ARROW-17635: [Python][CI] Sync conda recipe with the arrow-cpp feedstock
Corresponds to status of feedstock as of https://github.com/conda-forge/arrow-cpp-feedstock/pull/848, minus obvious & intentional divergences in the setup here (with the exception of unpinning xsimd, which was pinned as of 9.0.0, but isn't anymore).
https://issues.apache.org/jira/browse/ARROW-17635
:warning: Ticket has not been started in JIRA, please click 'Start Progress'.
@github-actions crossbow submit -g conda
Revision: 4d94128d84ffd37c0ed561c71f6f1cee9f92ff7b
Submitted crossbow builds: ursacomputing/crossbow @ actions-07b5c91c58
Hmm, you might want to rebase and fix conflicts now that the C++17 PR was merged :-)
I think the config on our tasks.yml
has to be updated to match the new file names, as an example this line: https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml#L255 that's why the builds failed.
@h-vetinari I've never done it before, could you share, or point me to where I can investigate, the process you followed to update these files? From what I understand this is automated, we should document how to do it for future updates. I am happy to document it if you point me on what needs to be done.
Thanks for the PR!
I think the config on our
tasks.yml
has to be updated to match the new file names, as an example this line: https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml#L255 that's why the builds failed.
That was very much correct thanks! :)
@h-vetinari I've never done it before, could you share, or point me to where I can investigate, the process you followed to update these files? From what I understand this is automated, we should document how to do it for future updates. I am happy to document it if you point me on what needs to be done. Thanks for the PR!
Yes of course. This comes from more or less 1:1 from the so-called feedstock on the conda-forge side (though in this case it's still the state of a PR), where the recipe is maintained, and where all these .ci_support/*.yaml
files get generated based on the recipe & the conda-forge global pinning by our friendly automation bots.
Conda-forge is constantly updating builds against new versions (e.g. abseil/grpc/libprotobuf), and once arrow has been successfully built against those packages, the builds are published (by the CI on the feedstock) and the recipe is updated.
In this case, there's been a while since the last update, so our infrastructure has changed a bit. Also, the relatively scary-looking changes in bld-arrow.bat
are mostly just simple alphabetizations (more commit history on the feedstock).
I synced the recipe separately from the ci-files, taking care to undo some things that are intentionally different in arrow-CI:
diff --git a/dev/tasks/conda-recipes/arrow-cpp/bld-pyarrow.bat b/dev/tasks/conda-recipes/arrow-cpp/bld-pyarrow.bat
index e1073b563..bd20c79ef 100644
--- a/dev/tasks/conda-recipes/arrow-cpp/bld-pyarrow.bat
+++ b/dev/tasks/conda-recipes/arrow-cpp/bld-pyarrow.bat
@@ -1,16 +1,18 @@
+@echo on
pushd "%SRC_DIR%"\python
@rem the symlinks for cmake modules don't work here
-del cmake_modules\BuildUtils.cmake
-del cmake_modules\SetupCxxFlags.cmake
-del cmake_modules\CompilerInfo.cmake
-del cmake_modules\FindNumPy.cmake
-del cmake_modules\FindPythonLibsNew.cmake
-copy /Y "%SRC_DIR%\cpp\cmake_modules\BuildUtils.cmake" cmake_modules\
-copy /Y "%SRC_DIR%\cpp\cmake_modules\SetupCxxFlags.cmake" cmake_modules\
-copy /Y "%SRC_DIR%\cpp\cmake_modules\CompilerInfo.cmake" cmake_modules\
-copy /Y "%SRC_DIR%\cpp\cmake_modules\FindNumPy.cmake" cmake_modules\
-copy /Y "%SRC_DIR%\cpp\cmake_modules\FindPythonLibsNew.cmake" cmake_modules\
+@rem NOTE: In contrast to conda-forge, they work here as we clone from git.
+@rem del cmake_modules\BuildUtils.cmake
+@rem del cmake_modules\SetupCxxFlags.cmake
+@rem del cmake_modules\CompilerInfo.cmake
+@rem del cmake_modules\FindNumPy.cmake
+@rem del cmake_modules\FindPythonLibsNew.cmake
+@rem copy /Y "%SRC_DIR%\cpp\cmake_modules\BuildUtils.cmake" cmake_modules\
+@rem copy /Y "%SRC_DIR%\cpp\cmake_modules\SetupCxxFlags.cmake" cmake_modules\
+@rem copy /Y "%SRC_DIR%\cpp\cmake_modules\CompilerInfo.cmake" cmake_modules\
+@rem copy /Y "%SRC_DIR%\cpp\cmake_modules\FindNumPy.cmake" cmake_modules\
+@rem copy /Y "%SRC_DIR%\cpp\cmake_modules\FindPythonLibsNew.cmake" cmake_modules\
SET ARROW_HOME=%LIBRARY_PREFIX%
SET SETUPTOOLS_SCM_PRETEND_VERSION=%PKG_VERSION%
diff --git a/dev/tasks/conda-recipes/arrow-cpp/meta.yaml b/dev/tasks/conda-recipes/arrow-cpp/meta.yaml
index 107b8433a..fd0625296 100644
--- a/dev/tasks/conda-recipes/arrow-cpp/meta.yaml
+++ b/dev/tasks/conda-recipes/arrow-cpp/meta.yaml
@@ -1,6 +1,7 @@
-{% set version = "9.0.0" %}
+# NOTE: In constrast to the conda-forge recipe, ARROW_VERSION is a templated variable here.
+{% set version = ARROW_VERSION %}
{% set cuda_enabled = cuda_compiler_version != "None" %}
-{% set build_ext_version = "3.0.0" %}
+{% set build_ext_version = ARROW_VERSION %}
{% set build_ext = "cuda" if cuda_enabled else "cpu" %}
{% set proc_build_number = "0" %}
{% set llvm_version = "14" %}
@@ -10,15 +11,10 @@ package:
version: {{ version }}
source:
- url: https://dist.apache.org/repos/dist/release/arrow/arrow-{{ version }}/apache-arrow-{{ version }}.tar.gz
- sha256: a9a033f0a3490289998f458680d19579cf07911717ba65afde6cb80070f7a9b5
- patches:
- # backport of the following commit for compatibility with VS2019
- # https://github.com/apache/arrow/commit/897c186f475f3dd82c1ab47e5cfb87cb0fed8440
- - patches/0001-ARROW-17433-CI-C-Use-Visual-Studio-2019-on-AppVeyor-.patch
+ path: ../../../../
build:
- number: 6
+ number: 0
# for cuda support, building with one version is enough to be compatible with
# all later versions, since arrow is only using libcuda, and not libcudart.
skip: true # [(win or linux64) and cuda_compiler_version not in ("None", "10.2")]
@@ -105,9 +101,7 @@ outputs:
- re2
- snappy
- thrift-cpp
- # https://github.com/apache/arrow/blob/apache-arrow-9.0.0/cpp/cmake_modules/ThirdpartyToolchain.cmake#L2245
- # currently pins to exact xsimd version
- - xsimd 8.1.0
+ - xsimd
- zlib
- zstd
run:
After fixing the paths to the CI-support files in tasks.yml
, I decided to tackle a todo about the windows CUDA builds, which we also have on the feedstock. The one divergence is that I didn't enable the CUDA builds on aarch64, because these are currently run in emulation (and not cross-compiled), which takes ~6h instead of ~1h.
Hope that makes things clearer. 🙃
Though, to be frank, I really only know the conda-forge side of this, so for arrow-specific things, we should definitely ask @xhochy to double-check the work here.
@github-actions crossbow submit -g conda
Revision: fe737d3f03691a66eac223d6dcec8379d8faf863
Submitted crossbow builds: ursacomputing/crossbow @ actions-cde99ae1f3
Thanks for the detailed response!
Yes of course. This comes from more or less 1:1 from the so-called feedstock on the conda-forge side (though in this case it's still the state of a PR), where the recipe is maintained, and where all these
.ci_support/*.yaml
files get generated based on the recipe & the conda-forge global pinning by our friendly automation bots.
This .ci_support/*.yaml
are updated with conda smithy rerender
? Is this done with the automation bots or is this done on a periodic basis? I understand we should not do it in the arrow repo as we should just pull the upstream feedstock ones.
This
.ci_support/*.yaml
are updated withconda smithy rerender
? Is this done with the automation bots or is this done on a periodic basis? I understand we should not do it in the arrow repo as we should just pull the upstream feedstock ones.
It's possible to run this locally in principle, but it would be incomplete, because conda smithy rerender
cares about conda-forge's CI setup (in .azure-pipelines/*
etc.), and wouldn't correctly populate dev/tasks/tasks.yml
. But at least the CI yamls could be generated, yes.
Another option might be to use the conda-forge feedstock as a submodule (in a fork controlled by arrow) where it would be possible to:
- carry a small patch along the lines of what I showed aboved in the details-tag
- rebase that patch semi-regularly & update the included submodule commit here to point to that
- point
dev/tasks/tasks.yml
to consume the files from within the submodule
This would have slightly higher impedance for development work that needs to touch the conda recipe (i.e. part of the work would be a commit to that forked feedstock and updating the PR to point to that commit), but with the benefit that we could sync the changes that accumulated throughout the development of a new arrow version faithfully to the feedstock, and easily sync back infra updates from conda-forge the other way.
It seems FindLLVMAlt.cmake
was recently refactored and the recipe from conda-forge (adapted to 9.0.0) is not ready for that:
-- Could NOT find LLVM (missing: LLVM_DIR)
-- Could NOT find LLVM (missing: LLVM_DIR)
CMake Error at D:/bld/arrow-cpp-ext_1663151790677/_build_env/Library/share/cmake-3.24/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
Could NOT find LLVMAlt (missing: LLVM_PACKAGE_VERSION CLANG_EXECUTABLE
LLVM_FOUND LLVM_LINK_EXECUTABLE)
Interestingly, this only fails during the pyarrow
builds (for the CI jobs I looked at), after the arrow-cpp
builds were already successful...
@h-vetinari Can we determine what LLVM is used for during the PyArrow build?
@h-vetinari Can we determine what LLVM is used for during the PyArrow build?
Not sure I understand your question, but there's no LLVM but the one we specify to be there. ;-)
The recipe as posted in this PR works for 9.0 (otherwise the feedstock would be broken currently). where we currently use LLVM 14 (though note that the llvm artefacts are a bit split up in conda-forge, i.e. we distinguish between llvm the library, clang, libcxx, openmp, etc.). I don't know what LLVM_DIR
expects, but it would be quite inconvenient if this required the actual llvm-project sources.
Sorry. I guess my question was: what does PyArrow try to do with LLVM?
Sorry. I guess my question was: what does PyArrow try to do with LLVM?
That's a good question. I'd have to go hunting in what gets triggered by setup.py
- perhaps it's something not strictly necessary, perhaps it's a second step for the gandiva bitcode stuff, or something else entirely. I was hoping that people involved here would have more of an idea (as to what changed since 9.0).
Perhaps @kou (as the author of the changes in https://github.com/apache/arrow/pull/13892) knows.
We need LLVM headers to use Gandiva. (Some public Gandiva headers use LLVM headers.) If we build PyArrow with Gandiva support, we need LLVM as a build-time dependency.
We need LLVM headers to use Gandiva. (Some public Gandiva headers use LLVM headers.) If we build PyArrow with Gandiva support, we need LLVM as a build-time dependency.
Yes, and LLVM + its headers are during build time from the standard location in the $PREFIX
. CMake is easily able to find it based on that, but the custom (AFAICT) findLLVMAlt
has it's own checks that fail to find it now (or need something more; is LLVM_DIR
suppqosed to be $PREFIX/lib
or $PREFIX/include/llvm
or something else?)
We need LLVM headers to use Gandiva. (Some public Gandiva headers use LLVM headers.)
Ouch. That's rather bad. I hadn't noticed that.
We need LLVM headers to use Gandiva. (Some public Gandiva headers use LLVM headers.)
Ouch. That's rather bad. I hadn't noticed that.
Not sure it's bad (certainly not a problem from the POV of conda-forge), it's been running for a long time like that.
What changed recently is how those headers are detected.
Not sure it's bad (certainly not a problem from the POV of conda-forge), it's been running for a long time like that.
Yeah, that was more of a general statement, not specifically about conda-forge. But exposing such a large dependency as LLVM in headers should never have been accepted IMHO (especially as there's probably no reason to do so).
CMake is easily able to find it based on that, but the custom (AFAICT)
findLLVMAlt
has it's own checks that fail to find it now (or need something more; isLLVM_DIR
suppqosed to be$PREFIX/lib
or$PREFIX/include/llvm
or something else?)
Could you share the URL for the error?
But exposing such a large dependency as LLVM in headers should never have been accepted IMHO (especially as there's probably no reason to do so).
Let's discuss this on [email protected]
with Gandiva maintainers.
Could you share the URL for the error?
It's in the crossbow runs from this comment. The updates in this PR correspond to the state of the feedstock, with (essentially) the only change that here the recipe is building from the master branch rather than from a tagged version, i.e. after importing the recipe
folder into dev/tasks/conda-recipes/arrow-cpp
, I did:
diff --git a/dev/tasks/conda-recipes/arrow-cpp/meta.yaml b/dev/tasks/conda-recipes/arrow-cpp/meta.yaml
index 107b8433a..fd0625296 100644
--- a/dev/tasks/conda-recipes/arrow-cpp/meta.yaml
+++ b/dev/tasks/conda-recipes/arrow-cpp/meta.yaml
@@ -1,6 +1,7 @@
-{% set version = "9.0.0" %}
+# NOTE: In constrast to the conda-forge recipe, ARROW_VERSION is a templated variable here.
+{% set version = ARROW_VERSION %}
{% set cuda_enabled = cuda_compiler_version != "None" %}
-{% set build_ext_version = "3.0.0" %}
+{% set build_ext_version = ARROW_VERSION %}
{% set build_ext = "cuda" if cuda_enabled else "cpu" %}
{% set proc_build_number = "0" %}
{% set llvm_version = "14" %}
(which restored the previous state)
Thanks. It seems that LLVM related packages are removed from pyarrow's host requirements: https://github.com/apache/arrow/pull/14102/files#diff-e1dc9dc44f82817c013424829588fc3fe48e63c78b42ce47e012016a243de09bL187-L189
Could you restore them?
Could you restore them?
Yes! I did add the llvmdev headers & tools now, though I'm not sure about clangdev. Do you need libclang
/ libclang-cpp
as well? Can we make this more specific than the whole compiler + all corresponding libraries & binaries?
@github-actions crossbow submit -g conda
I think that we don't need clang
related but I'm not sure. Let's try.
Revision: 30c971236b8d88519dc7101cb62470bf2d8dca50
Submitted crossbow builds: ursacomputing/crossbow @ actions-4afbcae72b
https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=34819&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=3847
Could NOT find LLVMAlt (missing: CLANG_EXECUTABLE)
clang
is needed for our FindLLVMAlt.cmake
. (LLVMConfig.cmake
doesn't require it.)
Could you add clang
too for now?
Could you add
clang
too for now?
Done
clang
is needed for ourFindLLVMAlt.cmake
. (LLVMConfig.cmake
doesn't require it.)
That really sounds to me like FindLLVMAlt
should learn to find llvm without clang, as these are pretty heavy dependencies we shouldn't be adding gratuitously to the build.