raven icon indicating copy to clipboard operation
raven copied to clipboard

[TASK] Issue finding tensorflow during Install RAVEN libraries for Mac M2

Open yoshiurr-INL opened this issue 1 year ago • 32 comments


Under Discussion Topic

Machine Specification Equipment: MacBook Pro OS: Ventura 13.5 Processor: Apple M2 Max Screenshot 2023-07-27 at 12 28 32 PM

Summary of the topic to be discussed with the development team While installing RAVEN libraries using "--install", the pip install for tensorflow cannot find a version that satisfies the requirements of tensorflow==2.10.* Screenshot 2023-07-27 at 12 19 29 PM Screenshot 2023-07-27 at 12 19 58 PM Screenshot 2023-07-27 at 12 20 41 PM

When trying to use "--mamba" instead, the installation process does not start. Screenshot 2023-07-27 at 12 21 11 PM

Describe the solution you'd like to be implemented Identify whether this issue is common for Mac systems. Identify whether this issue is common for M1 and M2 chips.

Describe alternatives you've considered Maybe conda installing tensorflow?


For Change Control Board: Issue Review

This review should occur before any development is performed as a response to this issue.

  • [x] 1. Is it tagged with a type: defect or task?
  • [x] 2. Is it tagged with a priority: critical, normal or minor?
  • [x] 3. If it will impact requirements or requirements tests, is it tagged with requirements?
  • [x] 4. If it is a defect, can it cause wrong results for users? If so an email needs to be sent to the users.
  • [x] 5. Is a rationale provided? (Such as explaining why the improvement is needed or why current code is wrong.)

For Change Control Board: Issue Closure

This review should occur when the issue is imminently going to be closed.

  • [x] 1. If the issue is a defect, is the defect fixed?
  • [x] 2. If the issue is a defect, is the defect tested for in the regression test system? (If not explain why not.)
  • [x] 3. If the issue can impact users, has an email to the users group been written (the email should specify if the defect impacts stable or master)?
  • [x] 4. If the issue is a defect, does it impact the latest release branch? If yes, is there any issue tagged with release (create if needed)?
  • [x] 5. If the issue is being closed without a pull request, has an explanation of why it is being closed been provided?

yoshiurr-INL avatar Jul 27 '23 18:07 yoshiurr-INL

Hm, if you change the line in the dependencies.xml from: <tensorflow source="pip" os='mac,linux'>2.10</tensorflow> to <tensorflow os='mac,linux'>2.10</tensorflow> does it install?

(Note that we do not currently have automated testing on arm64)

joshua-cogliati-inl avatar Jul 27 '23 18:07 joshua-cogliati-inl

@joshua-cogliati-inl Joshua, I found the identical issue on my M1 MacBook Pro 13 inch (OS: Ventura 13.5; Processor: Apple M1), just like Ramon experienced.

I tried to edit the dependencies.xml as you suggested, and the conda environment can be established by ./scripts/establish_conda_env.sh --install.

However, after ./build_raven and ./run_tests -j4, 23 tests are marked as "Diff" or "Failed". See the attached log file.

Haoyu log_run_test_j4_20230802.log

Hm, if you change the line in the dependencies.xml from: <tensorflow source="pip" os='mac,linux'>2.10</tensorflow> to <tensorflow os='mac,linux'>2.10</tensorflow> does it install?

(Note that we do not currently have automated testing on arm64)

wanghy-anl avatar Aug 02 '23 16:08 wanghy-anl

Okay, so we can install it if we switch tensorflow back to conda-forge, but it fails some tests. I think the correct solution for this is probably to switch to a newer version of tensorflow.

joshua-cogliati-inl avatar Aug 02 '23 22:08 joshua-cogliati-inl

Thanks Joshua. Let me know if you have any candidate versions in your mind. I can test on my M1 machine (it's idle recently)

Okay, so we can install it if we switch tensorflow back to conda-forge, but it fails some tests. I think the correct solution for this is probably to switch to a newer version of tensorflow.

wanghy-anl avatar Aug 03 '23 15:08 wanghy-anl

Tensorflow 2.12 and 2.13 might be worth trying.

joshua-cogliati-inl avatar Aug 03 '23 17:08 joshua-cogliati-inl

I started testing tensorflow 2.12 in https://github.com/idaholab/raven/pull/2138 but we need a few updates for it.

joshua-cogliati-inl avatar Aug 03 '23 17:08 joshua-cogliati-inl

@joshua-cogliati-inl, here are the results: Using 2.12 (I modified Line 49 of dependencies.xml to <tensorflow os='mac,linux'>2.12</tensorflow>: Can establish conda environment, but has 14 Failed tests and 16 Diff tests, see log below; log_run_test_j4_tensorflow_2_12_2023AUG03.log

Using 2.13 (Only available through PIP channel, I modified Line 49 of dependencies.xml to <tensorflow source="pip" os='mac,linux'>2.13</tensorflow>: Can establish conda environment, but has 673 Failed tests, see log below; log_run_test_j4_tensorflow_2_13_2023AUG03.log

Tensorflow 2.12 and 2.13 might be worth trying.

wanghy-anl avatar Aug 03 '23 20:08 wanghy-anl

Hm, for 2.13, something is being done incorrectly:

ImportError: Failed to import grpc on Apple Silicon. On Apple Silicon machines, try `pip uninstall grpcio; conda install grpcio`. Check out https://docs.ray.io/en/master/ray-overview/installation.html#m1-mac-apple-silicon-support for more details.

joshua-cogliati-inl avatar Aug 03 '23 20:08 joshua-cogliati-inl

Is there anything we can do within raven's establish_conda_env.sh script?

Hm, for 2.13, something is being done incorrectly:

ImportError: Failed to import grpc on Apple Silicon. On Apple Silicon machines, try `pip uninstall grpcio; conda install grpcio`. Check out https://docs.ray.io/en/master/ray-overview/installation.html#m1-mac-apple-silicon-support for more details.

wanghy-anl avatar Aug 03 '23 20:08 wanghy-anl

It might be worth adding 'grpcio' as a conda dependency and see if that solves it.

joshua-cogliati-inl avatar Aug 03 '23 20:08 joshua-cogliati-inl

Otherwise, yes, we might need to modify establish_conda_env.sh

joshua-cogliati-inl avatar Aug 03 '23 21:08 joshua-cogliati-inl

I added the <grpcio/> to dependencies.xml, and the conda environment can be established, but 14 failed and 16 diff tests. See the dependencies.xml and log attached. dependencies_and_log_2023AUG04.zip

It might be worth adding 'grpcio' as a conda dependency and see if that solves it.

wanghy-anl avatar Aug 04 '23 15:08 wanghy-anl

I added the to dependencies.xml, and the conda environment can be established, but 14 failed and 16 diff tests. See the dependencies.xml and log attached.

It looks like a bunch of the diff and failed are because of the tensorflow update. So that is probably the first thing that we need to fix.

joshua-cogliati-inl avatar Aug 07 '23 16:08 joshua-cogliati-inl

Joshua, let me know when you need to test the fix. I can do the test on M1 chip.

wanghy-anl avatar Aug 08 '23 15:08 wanghy-anl

For future reference, these are the changes made to dependencies.xml compared to current devel (scipy is actually updated by a devel change, so we probably do not need to downgrade scipy, also smt was added in devel as well):

--- dependencies.xml	2023-08-28 10:20:41.567497521 -0600
+++ /tmp/.fr-NTKHA2/dependencies.xml	2023-08-04 08:39:21.000000000 -0600
@@ -37,7 +37,7 @@
   <main>
     <h5py/>
     <numpy>1.22</numpy>
-    <scipy>1.9</scipy>
+    <scipy>1.7</scipy>
     <scikit-learn>1.0</scikit-learn>
     <pandas/>
     <!-- Note most versions of xarray work, but some (such as 0.20) don't -->
@@ -46,8 +46,9 @@
     <matplotlib>3.5</matplotlib>
     <statsmodels>0.13</statsmodels>
     <cloudpickle>2.2</cloudpickle>
-    <tensorflow source="pip" os='mac,linux'>2.10</tensorflow>
-    <tensorflow source="pip" os='windows'>2.10</tensorflow>
+    <tensorflow source="pip" os='mac,linux'>2.13</tensorflow>
+    <tensorflow source="pip" os='windows'>2.13</tensorflow>
+    <grpcio/>
     <!-- conda is really slow on windows if the version is not specified.-->
     <python skip_check='True' os='windows'>3.8</python>
     <python skip_check='True' os='mac,linux'>3</python>
@@ -70,7 +71,6 @@
     <!-- redis is needed by ray, but on windows, this seems to need to be explicitly stated -->
     <redis source="pip" os='windows'/>
     <imageio source="pip">2.22</imageio>
-    <smt/>
     <line_profiler optional='True'/>
     <!-- <ete3 optional='True'/> -->
     <pywavelets optional='True'>1.1</pywavelets>

joshua-cogliati-inl avatar Aug 28 '23 16:08 joshua-cogliati-inl

Joshua, is this dependencies.xml in any branch? I can give it a try if you can point me to the correct branch.

wanghy-anl avatar Aug 28 '23 17:08 wanghy-anl

Joshua, is this dependencies.xml in any branch? I can give it a try if you can point me to the correct branch.

I just used the dependencies.xml file you included in your zip file, and I also just updated the https://github.com/idaholab/raven/pull/2138 with 2.13 instead of 2.12

joshua-cogliati-inl avatar Aug 28 '23 17:08 joshua-cogliati-inl

Thanks, I will wait until #2138 gets merged and then test it on M1 chip.

wanghy-anl avatar Aug 28 '23 17:08 wanghy-anl

Joshua, is this dependencies.xml in any branch? I can give it a try if you can point me to the correct branch.

It is on my joshua-cogliati-inl:tensorflow_212 branch that #2138 uses, it would be useful to know if it fixes things on the M1 chip.

joshua-cogliati-inl avatar Sep 05 '23 18:09 joshua-cogliati-inl

It is on my joshua-cogliati-inl:tensorflow_212 branch that #2138 uses, it would be useful to know if it fixes things on the M1 chip.

Thanks Joshua, Let me give it a try on M1 chip tonight or tomorrow. I will attach the log file here.

wanghy-anl avatar Sep 05 '23 19:09 wanghy-anl

FYI: If anyone uses the diff for the dependencies.xml, do not remove smt since that will cause newer versions of RAVEN to fail.

joshua-cogliati-inl avatar Sep 06 '23 15:09 joshua-cogliati-inl

On further investigation, smt does not seem to be available for macos amd64: https://pypi.org/project/smt/#files so we probably do need to change <smt/> to <smt optional='True'/> and put imports that use smt into try catch blocks.

joshua-cogliati-inl avatar Sep 06 '23 15:09 joshua-cogliati-inl

FYI: If anyone uses the diff for the dependencies.xml, do not remove smt since that will cause newer versions of RAVEN to fail.

Josh, you were correct. I deleted <smt/> in the attached dependencies_a.xml and 694 tests failed on M1 chip. See attached Log_Sep05_2023_a.log. So I re-added <smt source='pip'/> in the attached dependencies_b.xml and it runs better. 19 tests failed. See attached Log_Sep05_2023_b.log. Sep_5_2022_Trials.zip

wanghy-anl avatar Sep 06 '23 16:09 wanghy-anl

Some errors I saw:

File ".../raven/ravenframework/Optimizers/acquisitionFunctions/AcquisitionFunction.py", line 138, in conductAcquisition res = sciopt.differential_evolution(optFunc, bounds=self._bounds, polish=self._polish, maxiter=self._maxiter, tol=self._tol, TypeError: differential_evolution() got an unexpected keyword argument 'vectorized'

File ".../python3.10/site-packages/netCDF4/__init__.py", line 3, in <module> from ._netCDF4 import ImportError: dlopen(.../python3.10/site-packages/netCDF4/_netCDF4.cpython-310-darwin.so, 0x0002): symbol not found in flat namespace '_nc_close'

libc++abi: terminating due to uncaught exception of type boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::overflow_error>>: Error in function ibeta_derivative<e>(e,e,e): Overflow Error

Also, a bunch of diffs.

I think it is worth trying netcdf 1.6 to see if that fixes the netcdf errors. I think the floating point hardware must be a bit different and causing the overflow error and some of the diffs.

joshua-cogliati-inl avatar Sep 07 '23 17:09 joshua-cogliati-inl

[like] Congjian Wang reacted to your message:


From: Joshua J. Cogliati @.> Sent: Thursday, September 7, 2023 5:00:13 PM To: idaholab/raven @.> Cc: Congjian Wang @.>; Assign @.> Subject: [EXTERNAL] Re: [idaholab/raven] [UNDER-DISCUSSION] Issue finding tensorflow during Install RAVEN libraries for Mac M2 (Issue #2158)

Some errors I saw:

File ".../raven/ravenframework/Optimizers/acquisitionFunctions/AcquisitionFunction.py", line 138, in conductAcquisition res = sciopt.differential_evolution(optFunc, bounds=self._bounds, polish=self._polish, maxiter=self._maxiter, tol=self._tol, TypeError: differential_evolution() got an unexpected keyword argument 'vectorized'

File ".../python3.10/site-packages/netCDF4/init.py", line 3, in from ._netCDF4 import ImportError: dlopen(.../python3.10/site-packages/netCDF4/_netCDF4.cpython-310-darwin.so, 0x0002): symbol not found in flat namespace '_nc_close'

libc++abi: terminating due to uncaught exception of type boost::exception_detail::clone_impl<boost::exception_detail::error_info_injectorstd::overflow_error>: Error in function ibeta_derivative(e,e,e): Overflow Error

Also, a bunch of diffs.

I think it is worth trying netcdf 1.6 to see if that fixes the netcdf errors. I think the floating point hardware must be a bit different and causing the overflow error and some of the diffs.

— Reply to this email directly, view it on GitHubhttps://github.com/idaholab/raven/issues/2158#issuecomment-1710496869, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABX3L36I2DCB67MEVPVZ5STXZH4R3ANCNFSM6AAAAAA22PJ3RE. You are receiving this because you were assigned.Message ID: @.***>

wangcj05 avatar Sep 07 '23 17:09 wangcj05

So apparently the remaining errors are:

FAILED:
Diff tests/framework/redundantInputs
Diff tests/framework/NDGridProbabilityWeightValue
Diff tests/framework/CodeInterfaceTests/CobraTF/test3
Diff tests/framework/pca_sparseGridCollocation/polyCorrelation
Diff tests/framework/PostProcessors/LimitSurface/testLimitSurfaceIntegralPPWithBoundingError
Diff tests/framework/Optimizers/GeneticAlgorithms/simionescuConstrainedInvLin
Diff tests/framework/Samplers/SparseGrid/normal
Failed tests/framework/Samplers/SparseGrid/betanorm
Failed tests/framework/Samplers/SparseGrid/beta
Diff tests/framework/Samplers/SparseGrid/triangular
Diff tests/framework/pca_adaptive_sgc/test_adaptive_sgc_poly_pca_analytic

PASSED: 778
SKIPPED: 93
FAILED: 11

I think a lot of those are from differences between how arm64 and amd64 handle floating point numbers. (From what I have seen online, I think basic arithmetic (+-*/) are the same, but things like floating to integer and back are different as well as functions like sin which will give differences eventually)

joshua-cogliati-inl avatar Sep 11 '23 16:09 joshua-cogliati-inl

[like] Congjian Wang reacted to your message:


From: Joshua J. Cogliati @.> Sent: Monday, September 11, 2023 4:39:31 PM To: idaholab/raven @.> Cc: Congjian Wang @.>; Assign @.> Subject: [EXTERNAL] Re: [idaholab/raven] [UNDER-DISCUSSION] Issue finding tensorflow during Install RAVEN libraries for Mac M2 (Issue #2158)

So apparently the remaining errors are:

FAILED: Diff tests/framework/redundantInputs Diff tests/framework/NDGridProbabilityWeightValue Diff tests/framework/CodeInterfaceTests/CobraTF/test3 Diff tests/framework/pca_sparseGridCollocation/polyCorrelation Diff tests/framework/PostProcessors/LimitSurface/testLimitSurfaceIntegralPPWithBoundingError Diff tests/framework/Optimizers/GeneticAlgorithms/simionescuConstrainedInvLin Diff tests/framework/Samplers/SparseGrid/normal Failed tests/framework/Samplers/SparseGrid/betanorm Failed tests/framework/Samplers/SparseGrid/beta Diff tests/framework/Samplers/SparseGrid/triangular Diff tests/framework/pca_adaptive_sgc/test_adaptive_sgc_poly_pca_analytic

PASSED: 778 SKIPPED: 93 FAILED: 11

I think a lot of those are from differences between how arm64 and amd64 handle floating point numbers. (From what I have seen online, I think basic arithmetic (+-*/) are the same, but things like floating to integer and back are different as well as functions like sin which will give differences eventually)

— Reply to this email directly, view it on GitHubhttps://github.com/idaholab/raven/issues/2158#issuecomment-1714232402, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABX3L33CMZXKCOWWLZH5F2LXZ45EHANCNFSM6AAAAAA22PJ3RE. You are receiving this because you were assigned.Message ID: @.***>

wangcj05 avatar Sep 11 '23 16:09 wangcj05

Just FYI: (on M2, I had to download and "pip install" smt directly from https://github.com/SMTorg/SMT)

alfoa avatar Sep 25 '23 20:09 alfoa

@alfoa Yes, we are discussing smt at: https://github.com/idaholab/raven/pull/2138#discussion_r1337680697

joshua-cogliati-inl avatar Sep 27 '23 15:09 joshua-cogliati-inl

This issue is partly addressed by PR #2138

wangcj05 avatar Sep 29 '23 16:09 wangcj05