NNPOps
NNPOps copied to clipboard
[WIP] Run CI on macOS
Fixes #62
@peastman thanks for working on this, ping me if you run into any issues with missing packages
The Linux CUDA builds are failing with an error that I don't think is related to any of my changes? It looks like cuaev is compiled with an incompatible ABI.
11: E ImportError: /usr/share/miniconda3/envs/nnpops/lib/python3.7/site-packages/torchani/cuaev.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN3c108ListType3getENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_4Type24SingletonOrSharedTypePtrIS7_EE
The Mac Intel build is failing at the setup stage:
ResolvePackageNotFound:
- pytorch-cpu==1.11.0=1.11.0
- python=3.10[build=3.10.*]
It's not clear to me what's causing that. Checking the file list at https://anaconda.org/conda-forge/pytorch-cpu/files I see the file osx-64/pytorch-cpu-1.11.0-cpu_py310hde81af8_2.tar.bz2
which looks like the correct one?
To get builds running on ARM Mac, we need to get the M1 runner working again.
I have feeling the failures on Linux related to https://github.com/conda-forge/openmm-torch-feedstock/issues/20. So, I'll try Mamba for the CI.
The CI issue is fixed: https://github.com/openmm/NNPOps/pull/64
Nice! I don't have write access to this repo, but I'm guessing that once we merge the changes in from main things will pass.
Thanks! That fixed it. So now we just have the problem with it not finding the pytorch package. Any idea what's up with that?
Weird, it looks like it exists: https://anaconda.org/conda-forge/pytorch-cpu/1.11.0/download/osx-64/pytorch-cpu-1.11.0-cpu_py310hde81af8_2.tar.bz2 (looks like osx intel 1.11.0 with python 3.10) @peastman could you try bumping the python version from 3.10 to 3.9? My guess is that there is something missing for the python 3.10 build.
Looking for: ["cmake[version='>=3.20']", 'make', 'mdtraj', 'torchani==2.2.2', 'pytest', 'python=3.10[build=3.10.*]', 'pytorch-cpu==1.11.0=1.11.0', 'sysroot_linux-64==2.17']
Encountered problems while solving:
- nothing provides requested python 3.10* 3.10.*
- nothing provides requested pytorch-cpu ==1.11.0 1.11.0
Unless the issue is with this syntax, which looks suspect to me pytorch-cpu==1.11.0=1.11.0
since it looks like it is looking for version 1.11.0
and build version 1.11.0
which I know doesn't exist.
On an Intel Mac, I created an environment with Python 3.10. Then I typed the following command:
mamba install -c conda-forge cmake make mdtraj torchani=2.2.2 pytest pytorch-cpu=1.11.0 python=3.10
and it worked without problem. So why does it fail on CI???
Is there something going on with sed
shenanigans? pytorch-cpu==1.11.0=1.11.0
is still sus to me.
Perhaps echo/print the modified environment.yaml and see if you can use that to make the env?
I feel like something is off since it also can't find python 3.10
nothing provides requested python 3.10* 3.10.*
You're right. I had to rewrite the sed
command because it works differently on Mac than on Linux. It's leading to duplicated version numbers.
Progress! Here's the error it fails with now.
[ 69%] Linking CXX shared library libNNPOpsPyTorch.dylib
ld: library not found for -lmkl_intel_ilp64
It also reports what look like some genuine errors in the code:
/Users/runner/work/NNPOps/NNPOps/src/pytorch/CFConv.cpp:101:29: warning: equality comparison result unused [-Wunused-comparison]
activation_ == ::CFConv::ShiftedSoftplus;
~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/runner/work/NNPOps/NNPOps/src/pytorch/CFConv.cpp:101:29: note: use '=' to turn this equality comparison into an assignment
activation_ == ::CFConv::ShiftedSoftplus;
^~
=
/Users/runner/work/NNPOps/NNPOps/src/pytorch/CFConv.cpp:103:29: warning: equality comparison result unused [-Wunused-comparison]
activation_ == ::CFConv::Tanh;
~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
/Users/runner/work/NNPOps/NNPOps/src/pytorch/CFConv.cpp:103:29: note: use '=' to turn this equality comparison into an assignment
activation_ == ::CFConv::Tanh;
^~
=
Yes, the warnings have to be fixed.
I'm not a cmake expert, but I'm guessing we need to modify cmake to add the location of mkl_intel_ilp64 to LIBRARY_PATH
?
I don't have an x86 mac to do this locally. I did naively try adding mkl
to the dependencies (in case it wasn't getting pulled in on osx) on my fork that but that didn't fix the issue.
We don't directly use MKL anywhere. It's coming from PyTorch. The build scripts for PyTorch are designed to locate lots of different BLAS implementations and use whichever one is available. The conda packages for pytorch
list a BLAS implementation as a dependency. Which one depends on the platform, MKL on some and OpenBLAS on others.
What is interesting is I get a different linking error when building on conda-forge CI
ld: warning: dylib libmkl_intel_ilp64.dylib was built for newer macOS version (10.14) than being linked (10.9)
Which means on conda-forge at least I will need to bump the macOS SKD that is being used (at least I think)
That would make sense. 10.9 is a really old version, from 2013. It sounds like PyTorch requires a newer one.
The log shows the build is run on macOS 11.6.8. We don't specify a value for CMAKE_OSX_DEPLOYMENT_TARGET
, so I believe it should be building for that version.
CMAKE_OSX_DEPLOYMENT_TARGET
is set to 11.6 as we'd expect. CMake is successfully locating the library in /Users/runner/miniconda3/envs/nnpops/lib/libmkl_intel_ilp64.dylib
. It's in the environment's lib
directory, which should automatically be included in the link path. Strange.
I tried adding export DYLD_LIBRARY_PATH=/Users/runner/miniconda3/envs/nnpops/lib
to ensure it would find the library. I now get a different error.
dyld: Symbol not found: _catlas_dset
Referenced from: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLinearAlgebra.dylib
Expected in: /Users/runner/miniconda3/envs/nnpops/lib/libBLAS.dylib
in /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLinearAlgebra.dylib
The symbol _catlas_dset
comes from the libBLAS.dylib
provided by Apple as part of the Accelerate framework built into the OS. We seem to somehow be getting conflicts between two different BLAS implementations, one from Apple and one from Intel.
We seem to somehow be getting conflicts between two different BLAS implementations, one from Apple and one from Intel
It looks like there are some things we can try here: https://conda-forge.org/docs/maintainer/knowledge_base.html#blas