easybuild-easyblocks icon indicating copy to clipboard operation
easybuild-easyblocks copied to clipboard

create $XDG_CACHE_HOME for PyTorch tests

Open Flamefire opened this issue 2 years ago • 5 comments

The path must exist or PyTorch will show errors/warnings like:

UserWarning: Specified kernel cache directory could not be created! This disables kernel caching.

Flamefire avatar Oct 18 '22 09:10 Flamefire

Test report by @Flamefire

Overview of tested easyconfigs (in order)

  • SUCCESS PyTorch-1.10.0-fosscuda-2020b.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total) taurusml22 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5 See https://gist.github.com/c761c3d11bf2a0140a56aa2e933ccefd for a full test report.

Flamefire avatar Oct 18 '22 13:10 Flamefire

Test report by @boegel

Overview of tested easyconfigs (in order)

  • FAIL (build issue) PyTorch-1.10.0-foss-2021a.eb (partial log available at https://gist.github.com/39d079c7228013daf8a29747912e9e26)

Build succeeded for 0 out of 1 (1 easyconfigs in total) node3907.accelgor.os - Linux RHEL 8.4, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 520.61.05, Python 3.6.8 See https://gist.github.com/db3002446c8f4cdc98e99d5ba0d5a7e8 for a full test report.

boegel avatar Oct 19 '22 02:10 boegel

Test report by @boegel

Overview of tested easyconfigs (in order)

  • FAIL (build issue) PyTorch-1.9.0-foss-2020b.eb (partial log available at https://gist.github.com/495345009c7c03992e6dd6d100a96e37)

Build succeeded for 0 out of 1 (1 easyconfigs in total) node3539.doduo.os - Linux RHEL 8.4, x86_64, AMD EPYC 7552 48-Core Processor (zen2), Python 3.6.8 See https://gist.github.com/9c5ce58f5e0f91440e4e211a565f5234 for a full test report.

boegel avatar Oct 19 '22 20:10 boegel

I'm not sure why the 2 ECs failed for you but I'm quite certain not due to the change here which should be correct by inspection (and I guess some document may tell us that $XDG_CACHE_HOME must exist, so this fixes a bug)

Especially as the test build on PPC passed I'd assume this is ok. ;-)

Flamefire avatar Oct 20 '22 08:10 Flamefire

@Flamefire I agree with you, but I'm being cautious here: we're very close to the next EasyBuild release, and I don't want to merge a PR last-minute which breaks the installation of PyTorch.

I wouldn't expect that making sure that $XDG_CACHE_HOME exists causes trouble, but it does seem like the behavior is slightly different when $XDG_CACHE_HOME does exist (kernel caching is not disabled), so it doesn't seem impossible to me that this affects a handful of tests...

boegel avatar Oct 20 '22 10:10 boegel

My testbuild of PyTorch-1.10.0-fosscuda-2020b.eb hangs on "python -s -c from multiprocessing.resource_tracker import main;main(26)" (for 11h then I killed it...)

(with this easyblock but I do not believe that is it related) Will try again...

Same problem again without this change.

akesandgren avatar Oct 28 '22 19:10 akesandgren

Could this now be merged?

Flamefire avatar Nov 22 '22 13:11 Flamefire

Test report by @branfosj

Overview of tested easyconfigs (in order)

  • SUCCESS PyTorch-1.10.0-foss-2021a.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total) bear-pg0105u36b.bear.cluster - Linux RHEL 8.5, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8 See https://gist.github.com/f31aeda337f3cec583eed3ac1525dd8d for a full test report.

branfosj avatar Nov 22 '22 16:11 branfosj

Test report by @branfosj

Overview of tested easyconfigs (in order)

  • SUCCESS PyTorch-1.9.0-fosscuda-2020b.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total) bear-pg0212u17a.bear.cluster - Linux RHEL 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz (broadwell), 1 x NVIDIA Tesla P100-PCIE-16GB, 470.57.02, Python 3.6.8 See https://gist.github.com/ed8b85d63e0b5a7b44b1075285fcf52b for a full test report.

branfosj avatar Nov 23 '22 16:11 branfosj

Going in, thanks @Flamefire!

branfosj avatar Nov 23 '22 17:11 branfosj