pandas icon indicating copy to clipboard operation
pandas copied to clipboard

Use CMake for Build System

Open WillAyd opened this issue 3 years ago • 57 comments

Proof of concept. I think we can greatly simplify our building system using this. @mroeschke @lithomas1 @Dr-Irv

This is in an intermediate state, but you can do:

python setup.py
cmake .
make -j$(nproc)

make clean will clean up build artifacts, which right now excludes the generated Cython files but should include them in the future.

Have the following TODOs: [ ]: The CMAKE_SHARED_LIBRARY_SUFFIX is hard-coded to assume Py38 on linux [ ]: Get rid of python setup.py and have CMake perform cythonization [ ]: Update CI, documentation, etc...

WillAyd avatar Jun 16 '22 06:06 WillAyd

I tried this on Windows. Doing python setup.py caused the cythonize step to be done, but it was done serially. Can't that be done in parallel as when we do python setup.py build_ext -j 8 ?

Right now, cmake is not working, but I'm not sure I have the latest build tools installed.

Dr-Irv avatar Jun 16 '22 15:06 Dr-Irv

Yea it could be. I didn't put a lot of time into that yet but if you want to explore would be great. I ideally want CMake to build the Cython extensions. That way running make clean should clean up those generated files, while also would decouple a hard dependency on setuptools.

This is hard coded right now for Linux. For windows / your python version, you likely need to change like 8 of the top level CMake file to match the expected file ending for your platform:

set(CMAKE_SHARED_LIBRARY_SUFFIX .cpython-38-x86_64-linux-gnu.so)

This might be abstracted in some form via the CMake Python variables. If not something we can branch on in our file and upstream:

https://cmake.org/cmake/help/latest/module/FindPython.html#result-variables

WillAyd avatar Jun 16 '22 16:06 WillAyd

Well, I'm not sure I have the time nor understanding to get this working on Windows!

Dr-Irv avatar Jun 16 '22 16:06 Dr-Irv

Well, I'm not sure I have the time nor understanding to get this working on Windows!

Yah, i expect this would be easier to get working on Mac, but it still involves a learning curve. I'm fine embracing that for personal edification, curious how others feel

jbrockmendel avatar Jun 16 '22 16:06 jbrockmendel

Well, I'm not sure I have the time nor understanding to get this working on Windows!

Ah sorry...didn't think about make not working on Windows. I think you can do

cmake .
cmake --build .

On all systems to get this to work for an "in source" build

WillAyd avatar Jun 16 '22 17:06 WillAyd

So now that I'm on Visual Studio 2019 build tools, there is an issue with the cmake version when doing cmake . .

CMake Error at CMakeLists.txt:1 (cmake_minimum_required):
  CMake 3.21 or higher is required.  You are running version
  3.15.19080502-MSVC_2

The version of cmake used here is the one from VS 2019.

Secondly, python setup.py clean doesn't work, and it should to make sure I'm doing a build from scratch.

Dr-Irv avatar Jun 16 '22 17:06 Dr-Irv

I think we can drop the minimum CMake version to 3.14 which would help. I just pulled the 3.22 off their documentation, but the NumPy finder was added in 3.14

For cleaning, you could ideally do cmake --clean . That will automatically get rid of files generated via build. In its current form that excludes Cython generated files, but in a future iteration it should do all of it

WillAyd avatar Jun 16 '22 17:06 WillAyd

I made the following changes to CMakeLists.txt, and got a lot further on Windows:

cmake_minimum_required(VERSION 3.15)

set(CMAKE_SHARED_LIBRARY_SUFFIX .cp38-win_amd64.pyd)

Then I discovered that you have to specify the architecture to cmake. So I did cmake -A x64 --clean . I then got more adventurous and did a parallel build via cmake --build . --config Release --parallel 8 , discovering the parameters via some google searching. The build succeeded, using all my cores, BUT the pyd files ended up in subdirectories called Release of pandas/_libs/ and pandas/_libs/tslibs/ etc.

If I don't include --config Release it looks for a debugging version of the python library and fails.

Don't know ANYTHING about cmake to figure out how to get the files out of those Release subdirectories into the right place.

Also not sure that the --clean is removing previous files correctly (e.g., .obj files).

Dr-Irv avatar Jun 16 '22 18:06 Dr-Irv

Awesome! I think you can tweak those settings via some CMAKE variables:

https://stackoverflow.com/questions/45368513/how-to-prevent-cmake-from-creating-debug-release-folders-under-executable-output

WillAyd avatar Jun 16 '22 19:06 WillAyd

FYI the idiomatic approach to CMake is to do an "out of source" build, where we would have a dedicated build sub folder and all required elements would be copied into that. In such a case, I don't think the Release / Debug sub folders generated by VS would matter much.

However, we today do in source builds of pandas. So want to make sure we emulate that as best as possible for initial lift and shift

WillAyd avatar Jun 16 '22 19:06 WillAyd

However, we today do in source builds of pandas. So want to make sure we emulate that as best as possible for initial lift and shift

Once you figure that out for Linux, I hope it will be straightforward to do for Windows. I'll wait to hear about your progress.

Dr-Irv avatar Jun 16 '22 20:06 Dr-Irv

Sounds good. Thanks for the feedback so far - very helpful

WillAyd avatar Jun 16 '22 20:06 WillAyd

I'd be supportive of trying CMake. Generally hoping

  1. Not too much deviations across platforms Linux/MacOS/Windows
  2. Near identical setup & invocation for local development & CI system

mroeschke avatar Jun 16 '22 20:06 mroeschke

This should be closer to cross platform. Some notes since @Dr-Irv last tried.

  1. The min required version is 3.17 (released March 2020). This is because the Python_SOABI variable was introduced then, which gives us the required platform identifier
  2. Python_SOABI seems to be missing the extension. I'm guessing this is pyd on Windows from what @Dr-Irv suggested before

Since the Cython files are generated during compilation, I think these should work in a cross-platform manner:

cmake .
cmake --build . --parallel
cmake --build . --target clean

WillAyd avatar Jun 17 '22 00:06 WillAyd

  1. The min required version is 3.17 (released March 2020). This is because the Python_SOABI variable was introduced then, which gives us the required platform identifier

I think that may require an update to Visual Studio 2022 build tools on Windows, and right now the pandas development docs say to use VS 2019, which has cmake 3.15. So if we are to use this on Windows, we'd have to do one of the following:

  • figure out how to update cmake in a VS 2019 installation and document it accordingly
  • move to VS 2022 and see if pandas can be built using the newer version.

I'm not going to be experimenting with that!!!

Dr-Irv avatar Jun 17 '22 12:06 Dr-Irv

IIUC scipy recentlyish changed to use meson. Is the ecosystem converging on any build patterns?

jbrockmendel avatar Jun 17 '22 20:06 jbrockmendel

Good question to which I do not know the answer. My only reason for picking CMake was familiarity with it through some Arrow work

WillAyd avatar Jun 17 '22 20:06 WillAyd

SciPy wrote a blog post about moving to Meson: https://labs.quansight.org/blog/2021/07/moving-scipy-to-meson/

It has some useful links and discussion (note that a year later and with various collaboration between SciPy <--> Meson, not everything described there is still the same, in particular we added native Cython support, fixed the prefix issue, and SciPy now uses https://pypi.org/project/meson-python/ as a native build backend to run).

In the comments, there's also a link discussing a proposal to improve the build backend that uses cmake (scikit-build).

Whichever build system you choose, you do NOT want to have to manually glue setuptools and something else together.

eli-schwartz avatar Jun 17 '22 23:06 eli-schwartz

IIUC scipy recentlyish changed to use meson. Is the ecosystem converging on any build patterns?

I'm pretty sure numpy wants to use meson as well. (cc @rgommers).

I'm a little worried about CMake since I have no experience with it, and I am currently the only other person who maintains the wheel builds other than @simonjayhawkins.

Whatever we do, a clean out-of-the box build of pandas should just work, and there should be no extra dependencies not installed from PyPI(other than the C Compiler).

lithomas1 avatar Jun 17 '22 23:06 lithomas1

I'm pretty sure numpy wants to use meson as well. (cc @rgommers).

Yes indeed. NumPy and scikit-learn are planning to move to Meson, following SciPy, and I think other projects in the ecosystem are going to follow that lead. The SciPy 1.9.0 release (close to ready, RC1 hopefully this week) is the first release to default to Meson.

As for CMake, it is clearly a much more capable build system than distutils/setuptools, so it would be an improvement to move to it. For SciPy, CMake was the only other build system we seriously considered besides Meson. There's discussion on the choice in https://github.com/scipy/scipy/issues/13615. To summarize the main reasons:

  • (a) Meson is written in Python so easier to contribute to,
  • (b) Meson has much better documentation (although there's good third-party CMake docs/books), and
  • (c) Meson is overall a more well-designed and easier to use build system. This is an opinion clearly and not everyone will agree, but I think it has been confirmed by SciPy devs who were only familiar with CMake beforehand, and found Meson more pleasant to touch. If CMake would remove everything that is legacy or "not best-practice", it would probably be similar to Meson in usability and user-friendliness. The trouble is that the CMake devs won't do that, and the sum total of CMake is a bit of a mess.

I'll also add, now that the SciPy migration is mostly done, that the Meson devs have been quite helpful and responsive to questions and issues (they implemented Cython support for example, and improved usability around virtualenvs).

A few thoughts on this PR and the proposed Pandas change:

  • I see that the diff here deletes setup.py. I recommend not doing that, but having two build systems in parallel for one release until you know everything works (because it won't - niche platforms and more niche use cases are hard to test for example).
  • What Pandas needs from a build system is not that much, so it's not that difficult to experiment. Either CMake or Meson should have all you need build-wise.
  • The more important topic here is what you need Python packaging-wise. The pyproject.toml (PEP 517) hook mechanism is immature, and it was a big job to build and debug meson-python. Neither CMake nor Meson know about the details of Python packaging or pyproject.toml, so you need that kind of a "shim package". For CMake, that is scikit-build. It still heavily depends on setuptools and needs an overhaul. That's not as much of a problem for Pandas as it would have been for SciPy or NumPy, because latest setuptools is incompatible with numpy.distutils - which you don't need in Pandas. That said, I would investigate that, as @eli-schwartz also pointed out above.
  • From my perspective it would be preferable synergy-wise if all packages near the core of the PyData ecosystem made the same build system choice.
    • That said, there is also something to say for having two viable build system choices for packages with compiled code. It's more effort overall, but in the long run it could be healthy.
    • I'll note that large projects that use (or plan to use) CMake, like PyTorch or RAPIDS, do typically not distribute sdists on PyPI (or don't support PyPI at all in the RAPIDS case), so they don't have to worry about the pyproject.toml / PEP 517 support that's still lacking as of today.
  • I'll also note that support-wise, there is a NASA grant across NumPy, Pandas, SciPy and scikit-learn which was recently awarded (we'll need to write a blog post about that next week to share the details). This grant covers 0.5 FTE for 3 years specifically for build & packaging work. For that work it would be great if these four packages used the same build system. That said, if Pandas is going to make a different choice, I'm sure we can find some CMake expertise to help support that choice if needed (significant work on overhauling scikit-build is out of scope though).
  • In the end what is most important is that the maintainers who do the heavy lifting on build & packaging issues are happy with the final choice.

rgommers avatar Jun 18 '22 11:06 rgommers

Thanks @eli-schwartz and @rgommers for the feedback- lots of great consideration points. I'll give meson a look along with this. Sounds like Meson is great for the packages you've pointed out and Cmake gets us closer to Arrow (@jorisvandenbossche may have insight for that).

Either choice should represent a nice upgrade over our current system

WillAyd avatar Jun 18 '22 21:06 WillAyd

This is ready for review. In this process a few tests changed around that all seemed to exhibit flakey cross platform behavior. I'm guessing the build flags might be responsible for that but didn't want to go down a rabbit hole with that just yet.

This currently replaces python setup.py build_ext --inplace while still relying on setuptools for sdist and wheel creation. This creates the libraries in the source tree; in the future we may want to consider out of source builds for a more idiomatic approach

python setup.py clean --all now becomes cmake --build . --target clean cross platform, or can shorten that to make clean on Unix-like platforms

WillAyd avatar Jun 21 '22 03:06 WillAyd

I installed cmake into my conda environment on Windows. removed all cmakefiles from previous trials. Then pulled down the PR, and got this:

cmake --build . --target clean
Error: could not load cache

Dr-Irv avatar Jun 21 '22 14:06 Dr-Irv

@Dr-Irv did you run cmake . first? That should configure everything and create CMakeCache.txt

WillAyd avatar Jun 21 '22 15:06 WillAyd

@Dr-Irv did you run cmake . first? That should configure everything and create CMakeCache.txt

OK, now I did that, but had to run it twice. Here is the output:

CMake Error: CMAKE_GENERATOR was set but the specified generator doesn't exist. Using CMake default.
-- Building for: Visual Studio 17 2022
-- Selecting Windows SDK version 10.0.19041.0 to target Windows 10.0.19043.
-- The C compiler identification is MSVC 19.32.31329.0
-- The CXX compiler identification is MSVC 19.32.31329.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/MSVC/14.32.31326/bin/Hostx64/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/MSVC/14.32.31326/bin/Hostx64/x64/cl.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Python3: C:/Anaconda3/envs/pandas-dev/python.exe (found version "3.8.12") found components: Interpreter Development NumPy Development.Module Development.Embed
-- Configuring incomplete, errors occurred!
See also "C:/Code/pandas_dev/pandas/CMakeFiles/CMakeOutput.log".

There was nothing in the log that indicated what went wrong. Then I ran it again, and got:

-- Selecting Windows SDK version 10.0.19041.0 to target Windows 10.0.19043.
-- Configuring done
-- Generating done
-- Build files have been written to: C:/Code/pandas_dev/pandas

Tried the clean target and got:

cmake --build . --target clean
Microsoft (R) Build Engine version 17.2.1+52cd2da31 for .NET Framework
Copyright (C) Microsoft Corporation. All rights reserved.

Then left out the target, and got a bunch of errors related to not finding python38_d.lib. Then tried:

cmake --build . --config Release 

and it did a lot of building, but there were errors along the way, all similar to this (only showing one of them):

  Generating writers.c
  Traceback (most recent call last):
    File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\numpy\__init__.py", li
  ne 144, in <module>
      from . import core
    File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\numpy\core\__init__.py
  ", line 9, in <module>
      from numpy.version import version as __version__
    File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\numpy\version.py", lin
  e 3, in <module>
      from ._version import get_versions
    File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\numpy\_version.py", li
  ne 7, in <module>
      import json
  SystemError: initialization of json raised unreported exception
  Traceback (most recent call last):
    File "C:\Anaconda3\envs\pandas-dev\lib\runpy.py", line 194, in _run_module_
  as_main
      return _run_code(code, main_globals, None,
    File "C:\Anaconda3\envs\pandas-dev\lib\runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "C:\Anaconda3\envs\pandas-dev\lib\site-packages\cython.py", line 17, i
  n <module>
      main(command_line = 1)
    File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\Cython\Compiler\Main.p
  y", line 858, in main
      result = compile(sources, options)
    File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\Cython\Compiler\Main.p
  y", line 780, in compile
      return compile_multiple(source, options)
    File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\Cython\Compiler\Main.p
  y", line 750, in compile_multiple
      context = options.create_context()
    File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\Cython\Compiler\Main.p
  y", line 596, in create_context
      return Context(self.include_path, self.compiler_directives,
    File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\Cython\Compiler\Main.p
  y", line 80, in __init__
      from . import Builtin, CythonScope
    File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\Cython\Compiler\Cython
  Scope.py", line 5, in <module>
      from .UtilityCode import CythonUtilityCode
    File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\Cython\Compiler\Utilit
  yCode.py", line 3, in <module>
      from .TreeFragment import parse_from_strings, StringParseContext
    File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\Cython\Compiler\TreeFr
  agment.py", line 21, in <module>
      from . import Parsing
    File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\Cython\Compiler\Parsin
  g.py", line 31, in <module>
      from .ModuleNode import ModuleNode
    File "C:\Anaconda3\envs\pandas-dev\Lib\site-packages\Cython\Compiler\Module
  Node.py", line 13, in <module>
      import json
  ImportError: numpy.core.multiarray failed to import
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsof
t\VC\v170\Microsoft.CppCommon.targets(245,5): error MSB8066: Custom build for '
C:\Code\pandas_dev\pandas\CMakeFiles\30b853d5621d326296e9c444a7ad4744\writers.c
.rule;C:\Code\pandas_dev\pandas\pandas\_libs\CMakeLists.txt' exited with code 1
. [C:\Code\pandas_dev\pandas\pandas\_libs\writers.vcxproj]

Dr-Irv avatar Jun 21 '22 16:06 Dr-Irv

CMake Error: CMAKE_GENERATOR was set but the specified generator doesn't exist. Using CMake default.

I'm guessing this might be the error?

Then left out the target, and got a bunch of errors related to not finding python38_d.lib

IIUC VS builds things differently. On Unix platforms you specify release / debug at configure time. MSVC will generate release and debug simultaneously at build time if you don't specify --config Release

You might need to clean and try again for the remaining errors. My guess is some of the debug / release profiles generated by VS got mixed up

WillAyd avatar Jun 21 '22 16:06 WillAyd

You might need to clean and try again for the remaining errors. My guess is some of the debug / release profiles generated by VS got mixed up

I did that before doing the test above. Manually removed all the extra files that had been created and started over.

Dr-Irv avatar Jun 21 '22 16:06 Dr-Irv

Yea I think the different build types got mixed up when running the build without specifying --config Release though

WillAyd avatar Jun 21 '22 16:06 WillAyd

Yea I think the different build types got mixed up when running the build without specifying --config Release though

I'm still stuck with those errors about numpy.core.multiarray failed to import

Dr-Irv avatar Jun 21 '22 17:06 Dr-Irv

Definitely on the tip of this branch right? There were some previous issues where we would build an internal json module and then import json From numpy or Cython would pick that up instead of the stdlib. Seems similar to the traceback you have. Worth double checking there is no json.dll in your pandas._libs directory

WillAyd avatar Jun 21 '22 17:06 WillAyd