NNPOps icon indicating copy to clipboard operation
NNPOps copied to clipboard

[WIP] Run CI on macOS

Open peastman opened this issue 2 years ago • 21 comments

Fixes #62

peastman avatar Jul 28 '22 20:07 peastman

@peastman thanks for working on this, ping me if you run into any issues with missing packages

mikemhenry avatar Jul 28 '22 22:07 mikemhenry

The Linux CUDA builds are failing with an error that I don't think is related to any of my changes? It looks like cuaev is compiled with an incompatible ABI.

11: E   ImportError: /usr/share/miniconda3/envs/nnpops/lib/python3.7/site-packages/torchani/cuaev.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN3c108ListType3getENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_4Type24SingletonOrSharedTypePtrIS7_EE

The Mac Intel build is failing at the setup stage:

ResolvePackageNotFound: 
  - pytorch-cpu==1.11.0=1.11.0
  - python=3.10[build=3.10.*]

It's not clear to me what's causing that. Checking the file list at https://anaconda.org/conda-forge/pytorch-cpu/files I see the file osx-64/pytorch-cpu-1.11.0-cpu_py310hde81af8_2.tar.bz2 which looks like the correct one?

To get builds running on ARM Mac, we need to get the M1 runner working again.

peastman avatar Jul 28 '22 22:07 peastman

I have feeling the failures on Linux related to https://github.com/conda-forge/openmm-torch-feedstock/issues/20. So, I'll try Mamba for the CI.

raimis avatar Aug 01 '22 14:08 raimis

The CI issue is fixed: https://github.com/openmm/NNPOps/pull/64

raimis avatar Aug 01 '22 14:08 raimis

Nice! I don't have write access to this repo, but I'm guessing that once we merge the changes in from main things will pass.

mikemhenry avatar Aug 01 '22 17:08 mikemhenry

Thanks! That fixed it. So now we just have the problem with it not finding the pytorch package. Any idea what's up with that?

peastman avatar Aug 01 '22 18:08 peastman

Weird, it looks like it exists: https://anaconda.org/conda-forge/pytorch-cpu/1.11.0/download/osx-64/pytorch-cpu-1.11.0-cpu_py310hde81af8_2.tar.bz2 (looks like osx intel 1.11.0 with python 3.10) @peastman could you try bumping the python version from 3.10 to 3.9? My guess is that there is something missing for the python 3.10 build.

Looking for: ["cmake[version='>=3.20']", 'make', 'mdtraj', 'torchani==2.2.2', 'pytest', 'python=3.10[build=3.10.*]', 'pytorch-cpu==1.11.0=1.11.0', 'sysroot_linux-64==2.17']


Encountered problems while solving:
  - nothing provides requested python 3.10* 3.10.*
  - nothing provides requested pytorch-cpu ==1.11.0 1.11.0

Unless the issue is with this syntax, which looks suspect to me pytorch-cpu==1.11.0=1.11.0 since it looks like it is looking for version 1.11.0 and build version 1.11.0 which I know doesn't exist.

mikemhenry avatar Aug 01 '22 19:08 mikemhenry

On an Intel Mac, I created an environment with Python 3.10. Then I typed the following command:

mamba install -c conda-forge cmake make mdtraj torchani=2.2.2 pytest pytorch-cpu=1.11.0 python=3.10

and it worked without problem. So why does it fail on CI???

peastman avatar Aug 01 '22 20:08 peastman

Is there something going on with sed shenanigans? pytorch-cpu==1.11.0=1.11.0 is still sus to me. Perhaps echo/print the modified environment.yaml and see if you can use that to make the env?

mikemhenry avatar Aug 01 '22 20:08 mikemhenry

I feel like something is off since it also can't find python 3.10 nothing provides requested python 3.10* 3.10.*

mikemhenry avatar Aug 01 '22 20:08 mikemhenry

You're right. I had to rewrite the sed command because it works differently on Mac than on Linux. It's leading to duplicated version numbers.

peastman avatar Aug 01 '22 20:08 peastman

Progress! Here's the error it fails with now.

[ 69%] Linking CXX shared library libNNPOpsPyTorch.dylib
ld: library not found for -lmkl_intel_ilp64

It also reports what look like some genuine errors in the code:

/Users/runner/work/NNPOps/NNPOps/src/pytorch/CFConv.cpp:101:29: warning: equality comparison result unused [-Wunused-comparison]
                activation_ == ::CFConv::ShiftedSoftplus;
                ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/runner/work/NNPOps/NNPOps/src/pytorch/CFConv.cpp:101:29: note: use '=' to turn this equality comparison into an assignment
                activation_ == ::CFConv::ShiftedSoftplus;
                            ^~
                            =
/Users/runner/work/NNPOps/NNPOps/src/pytorch/CFConv.cpp:103:29: warning: equality comparison result unused [-Wunused-comparison]
                activation_ == ::CFConv::Tanh;
                ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
/Users/runner/work/NNPOps/NNPOps/src/pytorch/CFConv.cpp:103:29: note: use '=' to turn this equality comparison into an assignment
                activation_ == ::CFConv::Tanh;
                            ^~
                            =

peastman avatar Aug 01 '22 21:08 peastman

Yes, the warnings have to be fixed.

raimis avatar Aug 02 '22 14:08 raimis

I'm not a cmake expert, but I'm guessing we need to modify cmake to add the location of mkl_intel_ilp64 to LIBRARY_PATH? I don't have an x86 mac to do this locally. I did naively try adding mkl to the dependencies (in case it wasn't getting pulled in on osx) on my fork that but that didn't fix the issue.

mikemhenry avatar Aug 02 '22 17:08 mikemhenry

We don't directly use MKL anywhere. It's coming from PyTorch. The build scripts for PyTorch are designed to locate lots of different BLAS implementations and use whichever one is available. The conda packages for pytorch list a BLAS implementation as a dependency. Which one depends on the platform, MKL on some and OpenBLAS on others.

peastman avatar Aug 02 '22 17:08 peastman

What is interesting is I get a different linking error when building on conda-forge CI ld: warning: dylib libmkl_intel_ilp64.dylib was built for newer macOS version (10.14) than being linked (10.9) Which means on conda-forge at least I will need to bump the macOS SKD that is being used (at least I think)

mikemhenry avatar Aug 02 '22 18:08 mikemhenry

That would make sense. 10.9 is a really old version, from 2013. It sounds like PyTorch requires a newer one.

peastman avatar Aug 02 '22 18:08 peastman

The log shows the build is run on macOS 11.6.8. We don't specify a value for CMAKE_OSX_DEPLOYMENT_TARGET, so I believe it should be building for that version.

peastman avatar Aug 05 '22 21:08 peastman

CMAKE_OSX_DEPLOYMENT_TARGET is set to 11.6 as we'd expect. CMake is successfully locating the library in /Users/runner/miniconda3/envs/nnpops/lib/libmkl_intel_ilp64.dylib. It's in the environment's lib directory, which should automatically be included in the link path. Strange.

peastman avatar Aug 07 '22 20:08 peastman

I tried adding export DYLD_LIBRARY_PATH=/Users/runner/miniconda3/envs/nnpops/lib to ensure it would find the library. I now get a different error.

dyld: Symbol not found: _catlas_dset
  Referenced from: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLinearAlgebra.dylib
  Expected in: /Users/runner/miniconda3/envs/nnpops/lib/libBLAS.dylib
 in /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLinearAlgebra.dylib

The symbol _catlas_dset comes from the libBLAS.dylib provided by Apple as part of the Accelerate framework built into the OS. We seem to somehow be getting conflicts between two different BLAS implementations, one from Apple and one from Intel.

peastman avatar Aug 08 '22 00:08 peastman

We seem to somehow be getting conflicts between two different BLAS implementations, one from Apple and one from Intel

It looks like there are some things we can try here: https://conda-forge.org/docs/maintainer/knowledge_base.html#blas

mikemhenry avatar Aug 12 '22 17:08 mikemhenry