torch.cuda.is_available() aborts after module loading omnitrace
Before loading omnitrace:
(gpt-neox-rocm5.6.0) langx@frontier07915:/lustre/orion/csc549/scratch/langx> python
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.hip
'5.6.31061-8c743ae5d'
>>> torch.cuda.is_available()
True
After loading omnitrace/1.10.4:
(gpt-neox-rocm5.6.0) langx@frontier07915:/lustre/orion/csc549/scratch/langx> module load omnitrace/1.10.4
Using ROCm installation: /opt/rocm-5.6.0
(gpt-neox-rocm5.6.0) langx@frontier07915:/lustre/orion/csc549/scratch/langx> module list
Currently Loaded Modules:
1) craype-x86-trento 7) cce/15.0.0 13) darshan-runtime/3.4.0
2) libfabric/1.15.2.0 8) craype/2.7.19 14) hsi/default
3) craype-network-ofi 9) cray-dsmml/0.2.2 15) DefApps/default
4) perftools-base/22.12.0 10) cray-mpich/8.1.23 16) tmux/3.2a
5) xpmem/2.6.2-2.5_2.22__gd067c3f.shasta 11) cray-libsci/22.12.1.1 17) rocm/5.6.0
6) cray-pmi/6.1.8 12) PrgEnv-cray/8.3.3 18) omnitrace/1.10.4
(gpt-neox-rocm5.6.0) langx@frontier07915:/lustre/orion/csc549/scratch/langx> python
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.hip
'5.6.31061-8c743ae5d'
>>> torch.cuda.is_available()
Aborted
PyTorch verison: 2.1.2+rocm5.6.0 Omnitrace: 1.10.4
Is there something that needs to be checked first?
I have attached my rocminfo output, note that since on MI250X we don't support the get_power_avg() functions, it is reflected as an unsupported feature, however, outputting functions.json still hangs at the end.
rocminfo.log
Thanks in advance!
I have attached my rocminfo output, note that since on MI250X we don't support the get_power_avg() functions, it is reflected as an unsupported feature, however, outputting functions.json still hangs at the end. rocminfo.log
This was fixed in #331 and included in the v1.11.1 release.
However, I don’t think this is related to your problem whatsoever. Could you do a module show for that omnitrace module? And maybe compare the env before/after. I’m thinking there’s something being changed with regards to the LD_LIBRARY_PATH and the PYTHONPATH when that module gets loaded.
Hi @R0n12, do you still need assistance with this ticket? If not, please close the ticket. Thanks!
Hi @R0n12. Closing ticket due to lack of activity. Please feel free to re-open ticket if you still need assistance. Thanks!