Use CMake for Build System
Proof of concept. I think we can greatly simplify our building system using this. @mroeschke @lithomas1 @Dr-Irv
This is in an intermediate state, but you can do:
python setup.py
cmake .
make -j$(nproc)
make clean will clean up build artifacts, which right now excludes the generated Cython files but should include them in the future.
Have the following TODOs:
[ ]: The CMAKE_SHARED_LIBRARY_SUFFIX is hard-coded to assume Py38 on linux
[ ]: Get rid of python setup.py and have CMake perform cythonization
[ ]: Update CI, documentation, etc...
I tried this on Windows. Doing python setup.py caused the cythonize step to be done, but it was done serially. Can't that be done in parallel as when we do python setup.py build_ext -j 8 ?
Right now, cmake is not working, but I'm not sure I have the latest build tools installed.
Yea it could be. I didn't put a lot of time into that yet but if you want to explore would be great. I ideally want CMake to build the Cython extensions. That way running make clean should clean up those generated files, while also would decouple a hard dependency on setuptools.
This is hard coded right now for Linux. For windows / your python version, you likely need to change like 8 of the top level CMake file to match the expected file ending for your platform:
set(CMAKE_SHARED_LIBRARY_SUFFIX .cpython-38-x86_64-linux-gnu.so)
This might be abstracted in some form via the CMake Python variables. If not something we can branch on in our file and upstream:
https://cmake.org/cmake/help/latest/module/FindPython.html#result-variables
Well, I'm not sure I have the time nor understanding to get this working on Windows!
Well, I'm not sure I have the time nor understanding to get this working on Windows!
Yah, i expect this would be easier to get working on Mac, but it still involves a learning curve. I'm fine embracing that for personal edification, curious how others feel
Well, I'm not sure I have the time nor understanding to get this working on Windows!
Ah sorry...didn't think about make not working on Windows. I think you can do
cmake .
cmake --build .
On all systems to get this to work for an "in source" build
So now that I'm on Visual Studio 2019 build tools, there is an issue with the cmake version when doing cmake . .
CMake Error at CMakeLists.txt:1 (cmake_minimum_required):
CMake 3.21 or higher is required. You are running version
3.15.19080502-MSVC_2
The version of cmake used here is the one from VS 2019.
Secondly, python setup.py clean doesn't work, and it should to make sure I'm doing a build from scratch.
I think we can drop the minimum CMake version to 3.14 which would help. I just pulled the 3.22 off their documentation, but the NumPy finder was added in 3.14
For cleaning, you could ideally do cmake --clean . That will automatically get rid of files generated via build. In its current form that excludes Cython generated files, but in a future iteration it should do all of it
I made the following changes to CMakeLists.txt, and got a lot further on Windows:
cmake_minimum_required(VERSION 3.15)
set(CMAKE_SHARED_LIBRARY_SUFFIX .cp38-win_amd64.pyd)
Then I discovered that you have to specify the architecture to cmake. So I did cmake -A x64 --clean . I then got more adventurous and did a parallel build via cmake --build . --config Release --parallel 8 , discovering the parameters via some google searching. The build succeeded, using all my cores, BUT the pyd files ended up in subdirectories called Release of pandas/_libs/ and pandas/_libs/tslibs/ etc.
If I don't include --config Release it looks for a debugging version of the python library and fails.
Don't know ANYTHING about cmake to figure out how to get the files out of those Release subdirectories into the right place.
Also not sure that the --clean is removing previous files correctly (e.g., .obj files).
Awesome! I think you can tweak those settings via some CMAKE variables:
https://stackoverflow.com/questions/45368513/how-to-prevent-cmake-from-creating-debug-release-folders-under-executable-output
FYI the idiomatic approach to CMake is to do an "out of source" build, where we would have a dedicated build sub folder and all required elements would be copied into that. In such a case, I don't think the Release / Debug sub folders generated by VS would matter much.
However, we today do in source builds of pandas. So want to make sure we emulate that as best as possible for initial lift and shift
However, we today do in source builds of pandas. So want to make sure we emulate that as best as possible for initial lift and shift
Once you figure that out for Linux, I hope it will be straightforward to do for Windows. I'll wait to hear about your progress.
Sounds good. Thanks for the feedback so far - very helpful
I'd be supportive of trying CMake. Generally hoping
- Not too much deviations across platforms Linux/MacOS/Windows
- Near identical setup & invocation for local development & CI system
This should be closer to cross platform. Some notes since @Dr-Irv last tried.
- The min required version is 3.17 (released March 2020). This is because the Python_SOABI variable was introduced then, which gives us the required platform identifier
- Python_SOABI seems to be missing the extension. I'm guessing this is pyd on Windows from what @Dr-Irv suggested before
Since the Cython files are generated during compilation, I think these should work in a cross-platform manner:
cmake .
cmake --build . --parallel
cmake --build . --target clean
- The min required version is 3.17 (released March 2020). This is because the Python_SOABI variable was introduced then, which gives us the required platform identifier
I think that may require an update to Visual Studio 2022 build tools on Windows, and right now the pandas development docs say to use VS 2019, which has cmake 3.15. So if we are to use this on Windows, we'd have to do one of the following:
- figure out how to update
cmakein a VS 2019 installation and document it accordingly - move to VS 2022 and see if pandas can be built using the newer version.
I'm not going to be experimenting with that!!!
IIUC scipy recentlyish changed to use meson. Is the ecosystem converging on any build patterns?
Good question to which I do not know the answer. My only reason for picking CMake was familiarity with it through some Arrow work
SciPy wrote a blog post about moving to Meson: https://labs.quansight.org/blog/2021/07/moving-scipy-to-meson/
It has some useful links and discussion (note that a year later and with various collaboration between SciPy <--> Meson, not everything described there is still the same, in particular we added native Cython support, fixed the prefix issue, and SciPy now uses https://pypi.org/project/meson-python/ as a native build backend to run).
In the comments, there's also a link discussing a proposal to improve the build backend that uses cmake (scikit-build).
Whichever build system you choose, you do NOT want to have to manually glue setuptools and something else together.
IIUC scipy recentlyish changed to use meson. Is the ecosystem converging on any build patterns?
I'm pretty sure numpy wants to use meson as well. (cc @rgommers).
I'm a little worried about CMake since I have no experience with it, and I am currently the only other person who maintains the wheel builds other than @simonjayhawkins.
Whatever we do, a clean out-of-the box build of pandas should just work, and there should be no extra dependencies not installed from PyPI(other than the C Compiler).
I'm pretty sure numpy wants to use meson as well. (cc @rgommers).
Yes indeed. NumPy and scikit-learn are planning to move to Meson, following SciPy, and I think other projects in the ecosystem are going to follow that lead. The SciPy 1.9.0 release (close to ready, RC1 hopefully this week) is the first release to default to Meson.
As for CMake, it is clearly a much more capable build system than distutils/setuptools, so it would be an improvement to move to it. For SciPy, CMake was the only other build system we seriously considered besides Meson. There's discussion on the choice in https://github.com/scipy/scipy/issues/13615. To summarize the main reasons:
- (a) Meson is written in Python so easier to contribute to,
- (b) Meson has much better documentation (although there's good third-party CMake docs/books), and
- (c) Meson is overall a more well-designed and easier to use build system. This is an opinion clearly and not everyone will agree, but I think it has been confirmed by SciPy devs who were only familiar with CMake beforehand, and found Meson more pleasant to touch. If CMake would remove everything that is legacy or "not best-practice", it would probably be similar to Meson in usability and user-friendliness. The trouble is that the CMake devs won't do that, and the sum total of CMake is a bit of a mess.
I'll also add, now that the SciPy migration is mostly done, that the Meson devs have been quite helpful and responsive to questions and issues (they implemented Cython support for example, and improved usability around virtualenvs).
A few thoughts on this PR and the proposed Pandas change:
- I see that the diff here deletes
setup.py. I recommend not doing that, but having two build systems in parallel for one release until you know everything works (because it won't - niche platforms and more niche use cases are hard to test for example). - What Pandas needs from a build system is not that much, so it's not that difficult to experiment. Either CMake or Meson should have all you need build-wise.
- The more important topic here is what you need Python packaging-wise. The
pyproject.toml(PEP 517) hook mechanism is immature, and it was a big job to build and debug meson-python. Neither CMake nor Meson know about the details of Python packaging orpyproject.toml, so you need that kind of a "shim package". For CMake, that is scikit-build. It still heavily depends onsetuptoolsand needs an overhaul. That's not as much of a problem for Pandas as it would have been for SciPy or NumPy, because latestsetuptoolsis incompatible withnumpy.distutils- which you don't need in Pandas. That said, I would investigate that, as @eli-schwartz also pointed out above. - From my perspective it would be preferable synergy-wise if all packages near the core of the PyData ecosystem made the same build system choice.
- That said, there is also something to say for having two viable build system choices for packages with compiled code. It's more effort overall, but in the long run it could be healthy.
- I'll note that large projects that use (or plan to use) CMake, like PyTorch or RAPIDS, do typically not distribute sdists on PyPI (or don't support PyPI at all in the RAPIDS case), so they don't have to worry about the
pyproject.toml/ PEP 517 support that's still lacking as of today.
- I'll also note that support-wise, there is a NASA grant across NumPy, Pandas, SciPy and scikit-learn which was recently awarded (we'll need to write a blog post about that next week to share the details). This grant covers 0.5 FTE for 3 years specifically for build & packaging work. For that work it would be great if these four packages used the same build system. That said, if Pandas is going to make a different choice, I'm sure we can find some CMake expertise to help support that choice if needed (significant work on overhauling scikit-build is out of scope though).
- In the end what is most important is that the maintainers who do the heavy lifting on build & packaging issues are happy with the final choice.
Thanks @eli-schwartz and @rgommers for the feedback- lots of great consideration points. I'll give meson a look along with this. Sounds like Meson is great for the packages you've pointed out and Cmake gets us closer to Arrow (@jorisvandenbossche may have insight for that).
Either choice should represent a nice upgrade over our current system
This is ready for review. In this process a few tests changed around that all seemed to exhibit flakey cross platform behavior. I'm guessing the build flags might be responsible for that but didn't want to go down a rabbit hole with that just yet.
This currently replaces python setup.py build_ext --inplace while still relying on setuptools for sdist and wheel creation. This creates the libraries in the source tree; in the future we may want to consider out of source builds for a more idiomatic approach
python setup.py clean --all now becomes cmake --build . --target clean cross platform, or can shorten that to make clean on Unix-like platforms
I installed cmake into my conda environment on Windows. removed all cmakefiles from previous trials. Then pulled down the PR, and got this:
cmake --build . --target clean
Error: could not load cache
@Dr-Irv did you run cmake . first? That should configure everything and create CMakeCache.txt
@Dr-Irv did you run
cmake .first? That should configure everything and create CMakeCache.txt
OK, now I did that, but had to run it twice. Here is the output:
CMake Error: CMAKE_GENERATOR was set but the specified generator doesn't exist. Using CMake default.
-- Building for: Visual Studio 17 2022
-- Selecting Windows SDK version 10.0.19041.0 to target Windows 10.0.19043.
-- The C compiler identification is MSVC 19.32.31329.0
-- The CXX compiler identification is MSVC 19.32.31329.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/MSVC/14.32.31326/bin/Hostx64/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/MSVC/14.32.31326/bin/Hostx64/x64/cl.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Python3: C:/Anaconda3/envs/pandas-dev/python.exe (found version "3.8.12") found components: Interpreter Development NumPy Development.Module Development.Embed
-- Configuring incomplete, errors occurred!
See also "C:/Code/pandas_dev/pandas/CMakeFiles/CMakeOutput.log".
There was nothing in the log that indicated what went wrong. Then I ran it again, and got:
-- Selecting Windows SDK version 10.0.19041.0 to target Windows 10.0.19043.
-- Configuring done
-- Generating done
-- Build files have been written to: C:/Code/pandas_dev/pandas
Tried the clean target and got:
cmake --build . --target clean
Microsoft (R) Build Engine version 17.2.1+52cd2da31 for .NET Framework
Copyright (C) Microsoft Corporation. All rights reserved.
Then left out the target, and got a bunch of errors related to not finding python38_d.lib.
Then tried:
cmake --build . --config Release
and it did a lot of building, but there were errors along the way, all similar to this (only showing one of them):
Generating writers.c
Traceback (most recent call last):
File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\numpy\__init__.py", li
ne 144, in <module>
from . import core
File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\numpy\core\__init__.py
", line 9, in <module>
from numpy.version import version as __version__
File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\numpy\version.py", lin
e 3, in <module>
from ._version import get_versions
File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\numpy\_version.py", li
ne 7, in <module>
import json
SystemError: initialization of json raised unreported exception
Traceback (most recent call last):
File "C:\Anaconda3\envs\pandas-dev\lib\runpy.py", line 194, in _run_module_
as_main
return _run_code(code, main_globals, None,
File "C:\Anaconda3\envs\pandas-dev\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Anaconda3\envs\pandas-dev\lib\site-packages\cython.py", line 17, i
n <module>
main(command_line = 1)
File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\Cython\Compiler\Main.p
y", line 858, in main
result = compile(sources, options)
File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\Cython\Compiler\Main.p
y", line 780, in compile
return compile_multiple(source, options)
File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\Cython\Compiler\Main.p
y", line 750, in compile_multiple
context = options.create_context()
File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\Cython\Compiler\Main.p
y", line 596, in create_context
return Context(self.include_path, self.compiler_directives,
File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\Cython\Compiler\Main.p
y", line 80, in __init__
from . import Builtin, CythonScope
File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\Cython\Compiler\Cython
Scope.py", line 5, in <module>
from .UtilityCode import CythonUtilityCode
File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\Cython\Compiler\Utilit
yCode.py", line 3, in <module>
from .TreeFragment import parse_from_strings, StringParseContext
File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\Cython\Compiler\TreeFr
agment.py", line 21, in <module>
from . import Parsing
File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\Cython\Compiler\Parsin
g.py", line 31, in <module>
from .ModuleNode import ModuleNode
File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\Cython\Compiler\Module
Node.py", line 13, in <module>
import json
ImportError: numpy.core.multiarray failed to import
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsof
t\VC\v170\Microsoft.CppCommon.targets(245,5): error MSB8066: Custom build for '
C:\Code\pandas_dev\pandas\CMakeFiles\30b853d5621d326296e9c444a7ad4744\writers.c
.rule;C:\Code\pandas_dev\pandas\pandas\_libs\CMakeLists.txt' exited with code 1
. [C:\Code\pandas_dev\pandas\pandas\_libs\writers.vcxproj]
CMake Error: CMAKE_GENERATOR was set but the specified generator doesn't exist. Using CMake default.
I'm guessing this might be the error?
Then left out the target, and got a bunch of errors related to not finding
python38_d.lib
IIUC VS builds things differently. On Unix platforms you specify release / debug at configure time. MSVC will generate release and debug simultaneously at build time if you don't specify --config Release
You might need to clean and try again for the remaining errors. My guess is some of the debug / release profiles generated by VS got mixed up
You might need to clean and try again for the remaining errors. My guess is some of the debug / release profiles generated by VS got mixed up
I did that before doing the test above. Manually removed all the extra files that had been created and started over.
Yea I think the different build types got mixed up when running the build without specifying --config Release though
Yea I think the different build types got mixed up when running the build without specifying
--config Releasethough
I'm still stuck with those errors about numpy.core.multiarray failed to import
Definitely on the tip of this branch right? There were some previous issues where we would build an internal json module and then import json From numpy or Cython would pick that up instead of the stdlib. Seems similar to the traceback you have. Worth double checking there is no json.dll in your pandas._libs directory