Enable ROCm RNN-T Loss
Added HIPIFY code and small changes for ROCm. Targeting RNN-T loss.
Hi @jpvillam-amd
Thanks for the contribution. Can you rebase on the latest main branch? We had a bunch of CI breakage but it's been fixed on main branch.
Hi @jpvillam-amd
Thanks for the contribution. Can you rebase on the latest main branch? We had a bunch of CI breakage but it's been fixed on main branch.
Sure thing
CircleCI is not running. Let me close and reopen to send the web hook again.
@mthrok has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@malfet @mthrok It's not clear to me at all if these CI failures need any action from our end, since the failures are CUDA related. Can you please let us know what the next step should be here? Do we need another rebase?
cc @mthrok can you review this pr?
@malfet @mthrok It's not clear to me at all if these CI failures need any action from our end, since the failures are CUDA related. Can you please let us know what the next step should be here? Do we need another rebase?
Hi @jithunnair-amd
torchaudio team does not have CI or expertise to validate the change, so I am thinking that we merge it given that it does not break existing CPU/CUDA. Could you address my comment? Thanks
@mthrok I'm taking a look at refactoring the CMake files to address the review comments
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/audio/2485
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
:heavy_exclamation_mark: 1 Active SEVs
There are 1 currently active SEVs. If your PR is affected, please view them below:
:x: 2 New Failures, 4 Unrelated Failures
As of commit 596f7e1a07a5003a4add67cccaf8646168bbd6b1 with merge base 1638efee8ca3ef4f628d6cb1d3ed3af747b20f8c ():
NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:
👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes.
@mthrok please re-review. This PR is now up to date with latest main and the CMakeLists.txt has been improved.
@mthrok @malfet Can you please review?
@mthrok has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@jithunnair-amd
Should we be concerned about ROCm build job failures??
libtorchaudio.so: undefined symbol: __kmpc_fork_call
https://github.com/pytorch/audio/actions/runs/5539782179/jobs/10111098244?pr=2485
https://github.com/pytorch/audio/actions/runs/5539782179/jobs/10111098349?pr=2485
@jithunnair-amd
Should we be concerned about ROCm build job failures??
libtorchaudio.so: undefined symbol: __kmpc_fork_callhttps://github.com/pytorch/audio/actions/runs/5539782179/jobs/10111098244?pr=2485 https://github.com/pytorch/audio/actions/runs/5539782179/jobs/10111098349?pr=2485
Yes, we need to fix those before we can land this PR.
@jithunnair-amd Should we be concerned about ROCm build job failures??
libtorchaudio.so: undefined symbol: __kmpc_fork_callhttps://github.com/pytorch/audio/actions/runs/5539782179/jobs/10111098244?pr=2485 https://github.com/pytorch/audio/actions/runs/5539782179/jobs/10111098349?pr=2485Yes, we need to fix those before we can land this PR.
@jeffdaily @jithunnair-amd Got it. Would you be able to look into them?
@jeffdaily @jithunnair-amd
Seems like there was ROCm CI migration. Could you rebase or merge upstream? If possible, I would like to include this PR for the upcoming release.
@jeffdaily @jithunnair-amd
Seems like there was ROCm CI migration. Could you rebase or merge upstream? If possible, I would like to include this PR for the upcoming release.
Done. @mthrok
Note that the CMakeLists.txt refactor was reverted. Let's first focus on landing this PR that adds RNN-T for ROCm, then a follow-up PR that modernizes the CMake files.
Please guard USE_ROCM with CMake version check (as TorchAudio should be builder on systems with CMake older than 3.21)
Also, can you please add an explanation why explicit
#ifdef __HIP_PLATFORM_AMD__must be explicitly added .cu files? I.e. why can't hipify module take care of those mechanistically changes?
Would you be okay with accepting this PR as-is and then a follow-up PR that revisits the hipify strategy here? It would be nice to get the feature in. The hipify and CMakeLists.txt refactor was causing more trouble than expected, so it was backed out of this PR.
Please guard USE_ROCM with CMake version check (as TorchAudio should be builder on systems with CMake older than 3.21) Also, can you please add an explanation why explicit
#ifdef __HIP_PLATFORM_AMD__must be explicitly added .cu files? I.e. why can't hipify module take care of those mechanistically changes?Would you be okay with accepting this PR as-is and then a follow-up PR that revisits the hipify strategy here? It would be nice to get the feature in. The hipify and CMakeLists.txt refactor was causing more trouble than expected, so it was backed out of this PR.
@jeffdaily Yes, build CI jobs were green, so I will merge this.
@mthrok has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@mthrok merged this pull request in pytorch/audio@c5939616ddc17093747e896db6012b1f63792627.
@mthrok This PR introduces a dependency on libomp.so which is available with a standard ROCm installation in /opt/rocm/llvm/lib/libomp.so. However, since it's not packaged with the torchaudio wheel currently, and is also not packaged with the torch wheels, it results in an import-time failure for torchaudio:
>>> import torchaudio
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.8/dist-packages/torchaudio/__init__.py", line 1, in <module>
from . import ( # noqa: F401
File "/usr/local/lib/python3.8/dist-packages/torchaudio/_extension/__init__.py", line 45, in <module>
_load_lib("libtorchaudio")
File "/usr/local/lib/python3.8/dist-packages/torchaudio/_extension/utils.py", line 64, in _load_lib
torch.ops.load_library(path)
File "/usr/local/lib/python3.8/dist-packages/torch/_ops.py", line 839, in load_library
ctypes.CDLL(path)
File "/usr/lib/python3.8/ctypes/__init__.py", line 373, in __init__
self._handle = _dlopen(self._name, mode)
OSError: libomp.so: cannot open shared object file: No such file or directory
I'm investigating how to package the libomp.so with the torchaudio wheel for ROCm. If there are any existing instances where you do the same for any build of torchaudio (ROCm or otherwise), please point me to it, so I can re-use the same flow.
cc @jeffdaily @atalman
@jeffdaily @jithunnair-amd how critical this PR is ? maybe we can just rever it ?
Hi @jeffdaily @jithunnair-amd - So this PR was reverted due to the packaging issue. I wanted to have this in the upcoming release and I do welcome the re-attempt. As to make the process faster for the future, I invited @jeffdaily for write permission. I hope we can continue and iterate on this feature. Regards