[Bug]: Realtime mode perf issues with ROCm
Project Version
3.5.0
Platform and OS Version
Linux, ROCm 6.4, gfx1100
Affected Devices
N/A
Existing Issues
No response
What happened?
Mostly creating this to document my findings and start a conversation between AMD users.
Even after tuning the MiOpen FindDB, using real-time mode results in high GPU usage and spotty audio. The output is unstable and can't even hold for 1s without breaking apart.
When comparing to deiteris/voice-changer, that one uses <20% GPU while providing continuous output at 128ms buf size and 1.6s extra.
Steps to reproduce
- Create venv
- Install ROCm versions of torch, torchvision, torchaudio
- Install remaining requirements from txt
- Start real-time mode
Expected behavior
I would expect it to be able to provide continuous output on a high-end card.
Attachments
No response
Screenshots or Videos
No response
Additional Information
No response
I upgraded my GPU from an unsupported 5500 XT to a 9060 XT, which ROCm supports out of the box. I also have the same problem.
The real-time feature is currently unusable for me, even after tuning MIOpen. Regardless of the performance settings, the output voice always cuts out midway through. In contrast, the original RVC does not have this issue.
Tested again with 3.6.0 using --client audio, ROCm 7.1
128ms chunk size and 1.6s extra → unusable 256ms chunk size and 1.6s extra → unusable 384ms chunk size and 1.6s extra → unusable 512ms chunk size and 1.6s extra → works and sounds ok but uses 100% of gpu
(default setting) 512ms chunk size and 0.5s extra → works and sounds ok but uses 100% of gpu
It reports a latency of around 300ms, while in reality the delay is above 1s.
Tested again with 3.6.0 using --client audio, ROCm 7.1
128ms chunk size and 1.6s extra → unusable 256ms chunk size and 1.6s extra → unusable 384ms chunk size and 1.6s extra → unusable 512ms chunk size and 1.6s extra → works and sounds ok but uses 100% of gpu
(default setting) 512ms chunk size and 0.5s extra → works and sounds ok but uses 100% of gpu
It reports a latency of around 300ms, while in reality the delay is above 1s.
I don’t have extensive experience with ROCm, but I believe this performance issue occurs for multiple reasons—for example, Applio currently lacks ONNX Runtime optimizations for acceleration, and real-time processing does not support FP16, which leads to increased GPU usage.
Regarding latency being lower than actual, you are correct. The displayed latency only reflects the real-time conversion pipeline and does not include the WebSocket transmission pipeline.
@mitsuami-megane are you testing with TheRock build?
No, what I did is:
- Git clone tag 3.6.0
- Created the venv
- Commented out
torch torchaudio torchvisionfromrequirements.txt. - Installed the matching versions of
torch torchaudio torchvisionusing the official PyTorch ROCm index instead - Installed the rest of the dependencies.
This is what I normally do with other torch-based software, and it seems to work well in most cases.
Where does one find this TheRock build?
Where does one find this TheRock build?
This one?
--index-url https://rocm.nightlies.amd.com/v2/gfx120X-all/
https://download.pytorch.org/whl/rocm6.4 is what I was using because this has a matching PyTorch version to what's in requirements.txt
With ROCm 7.1.1. As far as I'm aware, PyTorch targeting older ROCm works fine on newer ROCm installations.
I'm not sure how happy Applio would be on a completely different PyTorch version.
I'm not sure how happy Applio would be on a completely different PyTorch version.
well, you can try with the index I've provided and report issues if any.