easybuild-easyconfigs icon indicating copy to clipboard operation
easybuild-easyconfigs copied to clipboard

{tools}[gfbf/2023a] jax v0.4.25 w/ CUDA 12.1.1

Open ThomasHoffmann77 opened this issue 1 year ago • 79 comments

(created using eb --new-pr) requires:

  • [x] #20707

edit: requires bug fix in framework for "cp %s %(builddir)s/archives" to work as extract command:

  • https://github.com/easybuilders/easybuild-framework/pull/4532

ThomasHoffmann77 avatar Mar 14 '24 15:03 ThomasHoffmann77

Test report by @thomashoffmann77 FAILED Build succeeded for 1 out of 2 (2 easyconfigs in total) srv-mahamid-01.embl.de - Linux AlmaLinux 8.8, x86_64, AMD EPYC 7513 32-Core Processor, 2 x NVIDIA NVIDIA GeForce RTX 3090, 535.113.01, Python 3.6.8 See https://gist.github.com/ThomasHoffmann77/43d87811306655a013126860c0bb6777 for a full test report.

ThomasHoffmann77 avatar Mar 14 '24 15:03 ThomasHoffmann77

Test report by @thomashoffmann77 FAILED Build succeeded (with --ignore-test-failure) for 1 out of 2 (2 easyconfigs in total) srv-mahamid-01.embl.de - Linux AlmaLinux 8.8, x86_64, AMD EPYC 7513 32-Core Processor, 2 x NVIDIA NVIDIA GeForce RTX 3090, 535.113.01, Python 3.6.8 See https://gist.github.com/ThomasHoffmann77/c51c43986eae5a7afe56f715d7c5c38c for a full test report.

ThomasHoffmann77 avatar Mar 14 '24 16:03 ThomasHoffmann77

Test report by @thomashoffmann77 SUCCESS Build succeeded (with --ignore-test-failure) for 2 out of 2 (2 easyconfigs in total) proline - Linux AlmaLinux 8.8, x86_64, 12th Gen Intel(R) Core(TM) i7-12700, 1 x NVIDIA NVIDIA RTX A4000, 535.113.01, Python 3.6.8 See https://gist.github.com/ThomasHoffmann77/b2b075b38d9d9d5c6fe4b0503dab7279 for a full test report.

ThomasHoffmann77 avatar Mar 14 '24 18:03 ThomasHoffmann77

Test report by @branfosj SUCCESS Build succeeded (with --ignore-test-failure) for 2 out of 2 (2 easyconfigs in total) bear-pg0208u15a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 1 x NVIDIA NVIDIA A100-SXM4-40GB, 535.154.05, Python 3.6.8 See https://gist.github.com/branfosj/83b07adf11f9a9eea619d5b7e45eddb5 for a full test report.

Same three failures as https://github.com/easybuilders/easybuild-easyconfigs/pull/19841#issuecomment-1950232656

branfosj avatar Mar 14 '24 21:03 branfosj

Test report by @branfosj SUCCESS Build succeeded (with --ignore-test-failure) for 2 out of 2 (2 easyconfigs in total) bear-pg0208u31a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 4 x NVIDIA NVIDIA A100-SXM4-40GB, 535.154.05, Python 3.6.8 See https://gist.github.com/branfosj/bec290f9c00aa6309ee649e8ff185675 for a full test report.

Same three failures as https://github.com/easybuilders/easybuild-easyconfigs/pull/19841#issuecomment-1950232656

branfosj avatar Mar 15 '24 00:03 branfosj

Test report by @thomashoffmann77 SUCCESS Build succeeded (with --ignore-test-failure) for 2 out of 2 (2 easyconfigs in total) proline - Linux AlmaLinux 8.8, x86_64, 12th Gen Intel(R) Core(TM) i7-12700, 1 x NVIDIA NVIDIA RTX A4000, 535.113.01, Python 3.6.8 See https://gist.github.com/ThomasHoffmann77/59e7a52712f520a524e93b5b5210551b for a full test report.

ThomasHoffmann77 avatar Mar 15 '24 13:03 ThomasHoffmann77

I don't have a build node setup to upload test reports. Did see this test error:

tests/lax_scipy_special_functions_test.py::LaxScipySpcialFunctionsTest::testScipySpecialFun_gammainc_s_1x4_float32_float64 PASSED                                                                                                    [ 55%]
tests/lax_scipy_special_functions_test.py::LaxScipySpcialFunctionsTest::testScipySpecialFun_gammainc_s_2x1x4_float32_float32 Fatal Python error: Aborted

verdurin avatar Mar 15 '24 15:03 verdurin

I see you're all building with --ignore-test-failure - is that expected with jax?

verdurin avatar Mar 15 '24 15:03 verdurin

I don't have a build node setup to upload test reports. Did see this test error:

tests/lax_scipy_special_functions_test.py::LaxScipySpcialFunctionsTest::testScipySpecialFun_gammainc_s_1x4_float32_float64 PASSED                                                                                                    [ 55%]
tests/lax_scipy_special_functions_test.py::LaxScipySpcialFunctionsTest::testScipySpecialFun_gammainc_s_2x1x4_float32_float32 Fatal Python error: Aborted
#16:38 thoffman@srv-mahamid-01#NVIDIA_TF32_OVERRIDE=0 CUDA_VISIBLE_DEVICES=0 XLA_PYTHON_CLIENT_ALLOCATOR=platform JAX_ENABLE_X64=true pytest -vv tests/lax_scipy_special_functions_test.py::LaxScipySpcialFunctionsTest::testScipySpecialFun_gammainc_s_2x1x4_float32_float32
============================= test session starts ==============================
platform linux -- Python 3.11.3, pytest-7.4.2, pluggy-1.2.0 -- /g/easybuild/x86_64/Rocky/8/rome/software/Python/3.11.3-GCCcore-12.3.0/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/tmp/jax-jax-v0.4.25/.hypothesis/examples'))
rootdir: /tmp/jax-jax-v0.4.25
configfile: pyproject.toml
plugins: xdist-3.3.1, hypothesis-6.88.1
collected 1 item                                                               

tests/lax_scipy_special_functions_test.py::LaxScipySpcialFunctionsTest::testScipySpecialFun_gammainc_s_2x1x4_float32_float32 PASSED [100%]
============================== 1 passed in 3.21s ===============================

ThomasHoffmann77 avatar Mar 15 '24 15:03 ThomasHoffmann77

Test report by @thomashoffmann77 SUCCESS Build succeeded for 2 out of 2 (2 easyconfigs in total) srv-mahamid-01.embl.de - Linux AlmaLinux 8.8, x86_64, AMD EPYC 7513 32-Core Processor, 2 x NVIDIA NVIDIA GeForce RTX 3090, 535.113.01, Python 3.6.8 See https://gist.github.com/ThomasHoffmann77/37e79d6b1006b4e8bee5438a97ef2ccd for a full test report.

ThomasHoffmann77 avatar Mar 19 '24 13:03 ThomasHoffmann77

Test report by @thomashoffmann77 SUCCESS Build succeeded for 2 out of 2 (2 easyconfigs in total) proline - Linux AlmaLinux 8.8, x86_64, 12th Gen Intel(R) Core(TM) i7-12700, 1 x NVIDIA NVIDIA RTX A4000, 535.113.01, Python 3.6.8 See https://gist.github.com/ThomasHoffmann77/61812c0e50d74c911c9d72e03155eac6 for a full test report.

ThomasHoffmann77 avatar Mar 19 '24 14:03 ThomasHoffmann77

Test report by @Flamefire FAILED Build succeeded for 6 out of 7 (2 easyconfigs in total) n1438 - Linux RHEL 8.7 (Ootpa), x86_64, Intel(R) Xeon(R) Platinum 8470 (icelake), Python 3.8.13 See https://gist.github.com/Flamefire/ee41d9059916ce8b1f93b9267d0c847f for a full test report.

Flamefire avatar Mar 22 '24 13:03 Flamefire

Test report by @Flamefire FAILED Build succeeded for 1 out of 2 (2 easyconfigs in total) i8002 - Linux Rocky Linux 8.7 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.8.13 See https://gist.github.com/Flamefire/327109d42642f3d3ed5c28565c08f20b for a full test report.

Flamefire avatar Mar 22 '24 13:03 Flamefire

In both cases the failure is:

external/upb/upb/table.c: In function upb_inttable_pop:
external/upb/upb/table.c:588:10: error: val.val may be used uninitialized [-Werror=maybe-uninitialized]
  588 |   return val;
      |          ^~~
external/upb/upb/table.c:585:13: note: val.val was declared here
  585 |   upb_value val;
      |             ^~~

Due to -Werror added here

XLA comes with even more dependencies (workspace*.bzl). Can we add them as local repositories too? Maybe even auto-generate those lists via a Python script or so (similar to e.g. findPythonDeps which outputs a list of Python packages for use in an EC. That script is bundled with EasyBuild so readily available)

Flamefire avatar Mar 22 '24 14:03 Flamefire

Test report by @Flamefire FAILED Build succeeded for 1 out of 2 (2 easyconfigs in total) n1265 - Linux RHEL 8.7 (Ootpa), x86_64, Intel(R) Xeon(R) Platinum 8470 (icelake), Python 3.8.13 See https://gist.github.com/Flamefire/8cbb16221ab8da073cee85c97c0dd911 for a full test report.

This is caused by a crash. It isn't really clear why it fails or in which test, as when I run the crashing test file manually it works. Attaching GDB shows ~LogMessageFatal() as the cause. Need more investigation into why, i.e. what the fatal error is, but this looks serious...

Flamefire avatar Mar 26 '24 12:03 Flamefire

similar to e.g. findPythonDeps which outputs a list of Python packages for use in an EC. That script is bundled with EasyBuild so readily available

@Flamefire is there an example/documentation how to use it?

ThomasHoffmann77 avatar Mar 27 '24 13:03 ThomasHoffmann77

similar to e.g. findPythonDeps which outputs a list of Python packages for use in an EC. That script is bundled with EasyBuild so readily available

@Flamefire is there an example/documentation how to use it?

Yes: findPythonDeps --help should explain all you need, if not I'm happy to improve this (or you ;-) ).
OTOH: findPythonDeps --ec foo-1.23.eb foo==1.23

Flamefire avatar Mar 28 '24 13:03 Flamefire

Test report by @akesandgren SUCCESS Build succeeded for 2 out of 2 (2 easyconfigs in total) b-cn1603.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, 1 x NVIDIA NVIDIA A100 80GB PCIe, 545.29.06, Python 3.10.12 See https://gist.github.com/akesandgren/d08ad604ee84517b32551db64fb98aef for a full test report.

akesandgren avatar Mar 28 '24 13:03 akesandgren

Test report by @Flamefire FAILED Build succeeded for 83 out of 84 (2 easyconfigs in total) i7006 - Linux Rocky Linux 8.7 (Green Obsidian), x86_64, AMD EPYC 7702 64-Core Processor (zen2), Python 3.8.13 See https://gist.github.com/Flamefire/cd74ee4cfc219de5e77ef36ac511001e for a full test report.

Flamefire avatar Mar 28 '24 21:03 Flamefire

I don't have a build node setup to upload test reports. Did see this test error:

tests/lax_scipy_special_functions_test.py::LaxScipySpcialFunctionsTest::testScipySpecialFun_gammainc_s_2x1x4_float32_float32 Fatal Python error: Aborted

That is the same I see: https://github.com/easybuilders/easybuild-easyconfigs/pull/20119#issuecomment-2020270192

Flamefire avatar Apr 02 '24 07:04 Flamefire

Test report by @VRehnberg SUCCESS ml_dtypes Build succeeded for 1 out of 1 (1 easyconfigs in total) alvis1-05 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 550.54.14, Python 3.6.8 See https://gist.github.com/VRehnberg/51fb53ea9ab9613ab516f2582dd2cd0d for a full test report.

VRehnberg avatar Apr 08 '24 11:04 VRehnberg

Test report by @VRehnberg SUCCESS jax Build succeeded for 1 out of 1 (1 easyconfigs in total) alvis1-05 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 550.54.14, Python 3.6.8 See https://gist.github.com/VRehnberg/38ee48d43360ec267bd780e39120e84e for a full test report.

VRehnberg avatar Apr 08 '24 14:04 VRehnberg

@ThomasHoffmann77 and @Flamefire if you ignore the single failing test (pytest -k "not testUfuncInputTypes") does the rest work then? How widespread are the issues?

VRehnberg avatar Apr 08 '24 14:04 VRehnberg

@ThomasHoffmann77 and @Flamefire if you ignore the single failing test (pytest -k "not testUfuncInputTypes") does the rest work then? How widespread are the issues?

@VRehnberg I cannot reproduce this test failure on my system. Maybe it helps in particular to add lax_numpy_test.py::NumpyUfuncTests::testUfuncInputTypes763 to the list of isolated tests.

ThomasHoffmann77 avatar Apr 08 '24 15:04 ThomasHoffmann77

@ThomasHoffmann77 and @Flamefire if you ignore the single failing test (pytest -k "not testUfuncInputTypes") does the rest work then? How widespread are the issues?

Very widespread. I tried to --deselect each failing test file(!) but then it just fails later on the next. The issue seems to be too many threads being created so the system runs out of resources. We have 208 cores (HT) so each ThreadPool it creates has 208 threads.

Flamefire avatar Apr 09 '24 07:04 Flamefire

OMP_NUM_THREADS=2 ?

akesandgren avatar Apr 09 '24 07:04 akesandgren

OMP_NUM_THREADS=2 ?

That doesn't affect the thread pools created by jax/xla. I found PJRT_NPROC for that but setting PJRT_NPROC=32 in local_test_exports also failed. Currently experimenting with both...

@verdurin How many cores does nproc report on your system?
@ThomasHoffmann77 As it works for you, how many is it on yours?

Flamefire avatar Apr 09 '24 10:04 Flamefire

@verdurin How many cores does nproc report on your system? @ThomasHoffmann77 As it works for you, how many is it on yours?

@Flamefire srv-mahamid-01.embl.de: 64 proline.embl.de: 20

ThomasHoffmann77 avatar Apr 09 '24 10:04 ThomasHoffmann77

Ok, maybe it isn't the number of threads after all. I tried with PJRT_NPROC=2 to only create a small number of threads, well below the ones on the working 64/20 core systems. But still the same issue. Running out of ideas... Still testing a few different combinations and versions.

Flamefire avatar Apr 11 '24 09:04 Flamefire

I found that there is a difference when running the tests on a machine with or without GPUs. I have a 96 core machine with GPUs and the build succeeds. The 208 core machine without GPUs fails.

  • @verdurin Does your test system has GPUs? If not that might be relevant.
  • @ThomasHoffmann77 I assume your machine(s) have GPUs?

Flamefire avatar Apr 12 '24 08:04 Flamefire