easybuild-easyconfigs icon indicating copy to clipboard operation
easybuild-easyconfigs copied to clipboard

{ai}[foss/2024a] PyTorch v2.7.1 w/ CUDA 12.6.0

Open Flamefire opened this issue 3 months ago • 18 comments

(created using eb --new-pr)

Requires:

  • [ ] https://github.com/easybuilders/easybuild-easyblocks/pull/3803
  • [x] https://github.com/easybuilders/easybuild-easyblocks/pull/3887
  • [ ] https://github.com/easybuilders/easybuild-easyconfigs/pull/23606
  • [x] https://github.com/easybuilders/easybuild-easyconfigs/pull/23120

I included the easyconfigs here for convenience

Flamefire avatar Sep 19 '25 11:09 Flamefire

Diff of new easyconfig(s) against existing ones is too long for a GitHub comment. Use --review-pr (and --review-pr-filter / --review-pr-max) locally.

github-actions[bot] avatar Sep 19 '25 11:09 github-actions[bot]

Test report by @Flamefire FAILED Build succeeded for 6 out of 7 (7 easyconfigs in total) i8005 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21 See https://gist.github.com/Flamefire/e092ebca5d265d11b91fd67a83f3af73 for a full test report.

Flamefire avatar Sep 25 '25 08:09 Flamefire

Test report by @Flamefire FAILED Build succeeded for 6 out of 7 (7 easyconfigs in total) c32 - Linux AlmaLinux 9.4, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 560.35.03, Python 3.9.18 See https://gist.github.com/Flamefire/f693e3f4804b88c790344400452a4cec for a full test report.

Flamefire avatar Sep 28 '25 22:09 Flamefire

Test report by @Flamefire FAILED Build succeeded for 6 out of 7 (7 easyconfigs in total) c144 - Linux AlmaLinux 9.4, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 560.35.03, Python 3.9.18 See https://gist.github.com/Flamefire/26d65e755c8535ecce98bd3fa964d59b for a full test report.

Flamefire avatar Oct 08 '25 19:10 Flamefire

Test report by @Flamefire FAILED Build succeeded for 6 out of 7 (7 easyconfigs in total) i8032 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21 See https://gist.github.com/Flamefire/de74d19eb7a953944822b58781747c62 for a full test report.

Flamefire avatar Oct 16 '25 13:10 Flamefire

Test report by @Flamefire FAILED Build succeeded for 6 out of 7 (7 easyconfigs in total) c23 - Linux AlmaLinux 9.4, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 560.35.03, Python 3.9.18 See https://gist.github.com/Flamefire/ab3477dac032ccef529dab68f21959d7 for a full test report.

Flamefire avatar Oct 17 '25 12:10 Flamefire

Test report by @boegel Using easyblocks from PR(s) https://github.com/easybuilders/easybuild-easyblocks/pull/3803 FAILED Build succeeded for 6 out of 7 (7 easyconfigs in total) node4307.litleo.os - Linux RHEL 9.6, x86_64, AMD EPYC 9454P 48-Core Processor (zen4), 1 x NVIDIA NVIDIA H100 NVL, 580.95.05, Python 3.9.21 See https://gist.github.com/boegel/206999229fc7e00980ac347bc1e717fd for a full test report.

boegel avatar Oct 25 '25 10:10 boegel

@Flamefire

Checksum verification for /tmp/eb-eoudzstd/files_pr23923/p/PyTorch/PyTorch-2.7.0_do-not-checkout-nccl.patch using {'PyTorch-2.7.0_do-not-checkout-nccl.patch':
'ad085a15dd36768ad33a934f53dc595da745e01697b44d431f8b70ae9d0eb567'} failed

boegel avatar Oct 25 '25 10:10 boegel

Test report by @boegel Using easyblocks from PR(s) https://github.com/easybuilders/easybuild-easyblocks/pull/3803 FAILED Build succeeded for 6 out of 7 (7 easyconfigs in total) node3308.joltik.os - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 580.95.05, Python 3.9.21 See https://gist.github.com/boegel/978c2c4a948d5383453e81a82532ff57 for a full test report.

boegel avatar Oct 25 '25 10:10 boegel

Seemingly changed by mistake. Fixed

Flamefire avatar Oct 25 '25 10:10 Flamefire

Test report by @boegel Using easyblocks from PR(s) https://github.com/easybuilders/easybuild-easyblocks/pull/3803 FAILED Build succeeded for 6 out of 7 (7 easyconfigs in total) node4307.litleo.os - Linux RHEL 9.6, x86_64, AMD EPYC 9454P 48-Core Processor (zen4), 1 x NVIDIA NVIDIA H100 NVL, 580.95.05, Python 3.9.21 See https://gist.github.com/boegel/e2a8a47106ef7afcf783242dc72cbe5b for a full test report.

boegel avatar Oct 27 '25 03:10 boegel

Test report by @boegel Using easyblocks from PR(s) https://github.com/easybuilders/easybuild-easyblocks/pull/3803 FAILED Build succeeded for 6 out of 7 (7 easyconfigs in total) node3308.joltik.os - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 580.95.05, Python 3.9.21 See https://gist.github.com/boegel/1e769e4191aacbb019cb496146b909d6 for a full test report.

boegel avatar Oct 27 '25 05:10 boegel

The H100 failures are mostly from inductor/test_cutlass_backend (39 failed, 1 passed, 2 skipped, 0 errors)
I expect most failures to be caused by "BytesWarning". That needs a rebuild of nvidia-cutlass with the added patch which I had only added to https://github.com/easybuilders/easybuild-easyconfigs/pull/23606 as I forgot I included the EC here.
But merged that now to this branch.

With the (now) default of 10 allowed failures that should be enough to pass

As for the V100: I already had more failures on A100 suggesting they don't test on "older" GPUs anymore... If you can attach the log of the test step I'll take a look at the failures

Flamefire avatar Oct 27 '25 08:10 Flamefire

Test report by @Flamefire ~~FAILED~~ Build succeeded for 6 out of 7 (7 easyconfigs in total) n1450.barnard.hpc.tu-dresden.de - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Platinum 8470 (sapphirerapids), Python 3.9.21 See https://gist.github.com/Flamefire/bc6c0f8510f18f3f95f0a1eed3eb848d for a full test report.

SUCCESS on rerun but upload failed due to expired token:

== COMPLETED: Installation ended successfully (took 18 hours 38 mins 21 secs)
== Results of the build can be found in the log file(s) /software/PyTorch/2.7.1-foss-2024a-CUDA-12.6.0/easybuild/easybuild-PyTorch-2.7.1-20251107.054206.log

Flamefire avatar Oct 30 '25 08:10 Flamefire

Test report by @Flamefire SUCCESS Build succeeded for 7 out of 7 (7 easyconfigs in total) c92 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21 See https://gist.github.com/Flamefire/42fc4314e957ad4b757f8fbd40d064dd for a full test report.

Flamefire avatar Oct 30 '25 18:10 Flamefire

Test report by @Flamefire SUCCESS Build succeeded for 7 out of 7 (7 easyconfigs in total) i8018 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21 See https://gist.github.com/Flamefire/cb905230972fee8ccf548435f534eddc for a full test report.

Flamefire avatar Oct 31 '25 09:10 Flamefire

Test report by @boegel Using easyblocks from PR(s) https://github.com/easybuilders/easybuild-easyblocks/pull/3803 FAILED Build succeeded for 6 out of 7 (total: 46 hours 5 mins 4 secs) (7 easyconfigs in total) node3907.accelgor.os - Linux RHEL 9.6, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 580.95.05, Python 3.9.21 See https://gist.github.com/boegel/3fa8bd784b0d43d7693a47a1abe0a206 for a full test report.

boegel avatar Dec 09 '25 10:12 boegel

4 (of 8) failures are in test_cpu_select_algorithm and test_select_algorithm which I assume have the same cause. However the errors are not in the gist, so can't tell

Is it possibly this one?

OSError: [Errno 9] Bad file descriptor

Then I have a patch for that.

In any case: I remove the allowed failures = 6, which now uses the default of 10 which would make your run pass.

Flamefire avatar Dec 09 '25 11:12 Flamefire

@Flamefire Does this help?

FAILED [15.1010s] inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmDynamicShapesCPU::test_linear_with_embedding_dynamic_shapes_batch_size_384_in_features_196_out_features_384_bias_True_cpu_bfloat16 - AssertionError: Scalars are not equal!

Expected 0 but got 1.
Absolute difference: 1
Relative difference: inf
FAILED [8.5177s] inductor/test_select_algorithm.py::TestSelectAlgorithm::test_convolution2 - torch._inductor.exc.InductorError: AssertionError: Incorrect result from choice TritonTemplateCaller(/tmp/eb-znw1uvam/tmpetw7b_dd/w4/cw4ifwbddzp4qwlfg6va3rsd267c6zhsko7zqc27lpclukanhgo7.py, ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, num_stages=5, num_warps=4)

Tensor-likes are not close!

Mismatched elements: 576 / 8704 (6.6%)
Greatest absolute difference: 1.9070932865142822 at index (211, 29) (up to 0.0001 allowed)
Greatest relative difference: 101.29678344726562 at index (86, 25) (up to 0.0001 allowed)

boegel avatar Dec 18 '25 09:12 boegel

This PR is now blocked due to a merge conflict that got introduced via:

  • https://github.com/easybuilders/easybuild-easyconfigs/pull/24793

boegel avatar Dec 18 '25 09:12 boegel

Rebased

Flamefire avatar Dec 18 '25 09:12 Flamefire

FAILED [15.1010s] inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmDynamicShapesCPU::test_linear_with_embedding_dynamic_shapes_batch_size_384_in_features_196_out_features_384_bias_True_cpu_bfloat16 - AssertionError: Scalars are not equal!

Looks different to my error

FAILED [8.5177s] inductor/test_select_algorithm.py::TestSelectAlgorithm::test_convolution2 - torch._inductor.exc.InductorError: AssertionError: Incorrect result from choice TritonTemplateCaller(/tmp/eb-znw1uvam/tmpetw7b_dd/w4/cw4ifwbddzp4qwlfg6va3rsd267c6zhsko7zqc27lpclukanhgo7.py, ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, num_stages=5, num_warps=4)

That's on H100? Looks related to an issue I fixed/skipped with PyTorch-2.9.0_skip-test_convolution1-on-H100.patch
So can be ignored.

Flamefire avatar Dec 18 '25 10:12 Flamefire