audio icon indicating copy to clipboard operation
audio copied to clipboard

Enable ROCm RNN-T Loss

Open jpvillam-amd opened this issue 3 years ago • 3 comments

Added HIPIFY code and small changes for ROCm. Targeting RNN-T loss.

jpvillam-amd avatar Jun 14 '22 19:06 jpvillam-amd

Hi @jpvillam-amd

Thanks for the contribution. Can you rebase on the latest main branch? We had a bunch of CI breakage but it's been fixed on main branch.

mthrok avatar Jun 15 '22 14:06 mthrok

Hi @jpvillam-amd

Thanks for the contribution. Can you rebase on the latest main branch? We had a bunch of CI breakage but it's been fixed on main branch.

Sure thing

jpvillam-amd avatar Jun 15 '22 15:06 jpvillam-amd

CircleCI is not running. Let me close and reopen to send the web hook again.

mthrok avatar Jun 15 '22 16:06 mthrok

@mthrok has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Feb 23 '23 15:02 facebook-github-bot

@malfet @mthrok It's not clear to me at all if these CI failures need any action from our end, since the failures are CUDA related. Can you please let us know what the next step should be here? Do we need another rebase?

jithunnair-amd avatar Mar 07 '23 17:03 jithunnair-amd

cc @mthrok can you review this pr?

nateanl avatar Mar 07 '23 20:03 nateanl

@malfet @mthrok It's not clear to me at all if these CI failures need any action from our end, since the failures are CUDA related. Can you please let us know what the next step should be here? Do we need another rebase?

Hi @jithunnair-amd

torchaudio team does not have CI or expertise to validate the change, so I am thinking that we merge it given that it does not break existing CPU/CUDA. Could you address my comment? Thanks

mthrok avatar Mar 09 '23 14:03 mthrok

@mthrok I'm taking a look at refactoring the CMake files to address the review comments

jithunnair-amd avatar Mar 16 '23 13:03 jithunnair-amd

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/audio/2485

Note: Links to docs will display an error until the docs builds have been completed.

:heavy_exclamation_mark: 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

:x: 2 New Failures, 4 Unrelated Failures

As of commit 596f7e1a07a5003a4add67cccaf8646168bbd6b1 with merge base 1638efee8ca3ef4f628d6cb1d3ed3af747b20f8c (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Jun 14 '23 15:06 pytorch-bot[bot]

@mthrok please re-review. This PR is now up to date with latest main and the CMakeLists.txt has been improved.

jeffdaily avatar Jun 14 '23 21:06 jeffdaily

@mthrok @malfet Can you please review?

jithunnair-amd avatar Jul 13 '23 06:07 jithunnair-amd

@mthrok has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Jul 13 '23 14:07 facebook-github-bot

@jithunnair-amd

Should we be concerned about ROCm build job failures??

libtorchaudio.so: undefined symbol: __kmpc_fork_call https://github.com/pytorch/audio/actions/runs/5539782179/jobs/10111098244?pr=2485 https://github.com/pytorch/audio/actions/runs/5539782179/jobs/10111098349?pr=2485

mthrok avatar Jul 13 '23 14:07 mthrok

@jithunnair-amd

Should we be concerned about ROCm build job failures??

libtorchaudio.so: undefined symbol: __kmpc_fork_call https://github.com/pytorch/audio/actions/runs/5539782179/jobs/10111098244?pr=2485 https://github.com/pytorch/audio/actions/runs/5539782179/jobs/10111098349?pr=2485

Yes, we need to fix those before we can land this PR.

jeffdaily avatar Jul 13 '23 18:07 jeffdaily

@jithunnair-amd Should we be concerned about ROCm build job failures?? libtorchaudio.so: undefined symbol: __kmpc_fork_call https://github.com/pytorch/audio/actions/runs/5539782179/jobs/10111098244?pr=2485 https://github.com/pytorch/audio/actions/runs/5539782179/jobs/10111098349?pr=2485

Yes, we need to fix those before we can land this PR.

@jeffdaily @jithunnair-amd Got it. Would you be able to look into them?

mthrok avatar Jul 14 '23 20:07 mthrok

@jeffdaily @jithunnair-amd

Seems like there was ROCm CI migration. Could you rebase or merge upstream? If possible, I would like to include this PR for the upcoming release.

mthrok avatar Aug 16 '23 16:08 mthrok

@jeffdaily @jithunnair-amd

Seems like there was ROCm CI migration. Could you rebase or merge upstream? If possible, I would like to include this PR for the upcoming release.

Done. @mthrok

jeffdaily avatar Aug 17 '23 17:08 jeffdaily

Note that the CMakeLists.txt refactor was reverted. Let's first focus on landing this PR that adds RNN-T for ROCm, then a follow-up PR that modernizes the CMake files.

jeffdaily avatar Aug 17 '23 17:08 jeffdaily

Please guard USE_ROCM with CMake version check (as TorchAudio should be builder on systems with CMake older than 3.21)

Also, can you please add an explanation why explicit #ifdef __HIP_PLATFORM_AMD__ must be explicitly added .cu files? I.e. why can't hipify module take care of those mechanistically changes?

Would you be okay with accepting this PR as-is and then a follow-up PR that revisits the hipify strategy here? It would be nice to get the feature in. The hipify and CMakeLists.txt refactor was causing more trouble than expected, so it was backed out of this PR.

jeffdaily avatar Aug 18 '23 20:08 jeffdaily

Please guard USE_ROCM with CMake version check (as TorchAudio should be builder on systems with CMake older than 3.21) Also, can you please add an explanation why explicit #ifdef __HIP_PLATFORM_AMD__ must be explicitly added .cu files? I.e. why can't hipify module take care of those mechanistically changes?

Would you be okay with accepting this PR as-is and then a follow-up PR that revisits the hipify strategy here? It would be nice to get the feature in. The hipify and CMakeLists.txt refactor was causing more trouble than expected, so it was backed out of this PR.

@jeffdaily Yes, build CI jobs were green, so I will merge this.

mthrok avatar Aug 18 '23 21:08 mthrok

@mthrok has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Aug 18 '23 21:08 facebook-github-bot

@mthrok merged this pull request in pytorch/audio@c5939616ddc17093747e896db6012b1f63792627.

facebook-github-bot avatar Aug 19 '23 05:08 facebook-github-bot

@mthrok This PR introduces a dependency on libomp.so which is available with a standard ROCm installation in /opt/rocm/llvm/lib/libomp.so. However, since it's not packaged with the torchaudio wheel currently, and is also not packaged with the torch wheels, it results in an import-time failure for torchaudio:

>>> import torchaudio
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/torchaudio/__init__.py", line 1, in <module>
    from . import (  # noqa: F401
  File "/usr/local/lib/python3.8/dist-packages/torchaudio/_extension/__init__.py", line 45, in <module>
    _load_lib("libtorchaudio")
  File "/usr/local/lib/python3.8/dist-packages/torchaudio/_extension/utils.py", line 64, in _load_lib
    torch.ops.load_library(path)
  File "/usr/local/lib/python3.8/dist-packages/torch/_ops.py", line 839, in load_library
    ctypes.CDLL(path)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 373, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libomp.so: cannot open shared object file: No such file or directory

I'm investigating how to package the libomp.so with the torchaudio wheel for ROCm. If there are any existing instances where you do the same for any build of torchaudio (ROCm or otherwise), please point me to it, so I can re-use the same flow.

cc @jeffdaily @atalman

jithunnair-amd avatar Aug 24 '23 10:08 jithunnair-amd

@jeffdaily @jithunnair-amd how critical this PR is ? maybe we can just rever it ?

atalman avatar Aug 29 '23 00:08 atalman

Hi @jeffdaily @jithunnair-amd - So this PR was reverted due to the packaging issue. I wanted to have this in the upcoming release and I do welcome the re-attempt. As to make the process faster for the future, I invited @jeffdaily for write permission. I hope we can continue and iterate on this feature. Regards

mthrok avatar Sep 07 '23 15:09 mthrok