FunASR
FunASR copied to clipboard
Work around FunASR kwargs state leaks
Long-Audio Slowdown in FunASR GPU Inferencing (Root cause: kwargs state leaks)
What I Observed
- First pass on a 30 min+ recording finishes quickly, but running the same clip again almost doubles the time (sometimes even longer).
- GPU stays on
cuda:0throughout (so it was not backend issue); the slowdown persists until the process is restarted.
Root Cause
- FunASR's Automodel keeps runtime configuration (
kwargs,vad_kwargs,punc_kwargs,spk_kwargs, etc.) in mutable dictionaries. - Long inferences mutate those dicts (e.g.,
torch_threadsgrows from the default 4 to the host’s 72 threads on my server, slowing down inference). FunASR never resets them, so the next request inherits the “dirty” state and slows down.
Fix
- Snapshot every
*_kwargsright afterAutoModelbuilds its modules and restore that baseline before each inference (including VAD, punctuation, diarization). - Reapply the intended values like
ncpuand only calltorch.set_num_threads()when needed, preventing thread drift. - Result: even recordings can be processed repeatedly without the default params getting contaminated.
FunASR 长音频GPU推理降速问题(根本原因:AutoModel 在初始时把所有运行配置放在同一个全局 kwargs 字典里,多模型推理时,这个字典会被内部逻辑实时修改,例如调整 batch_size、ncpu 等参数,推理结束后不会恢复原值)
观察现象
- 处理 30 分钟以上音频时,发现第一次推理速度很快,但同一段音频第二次推理耗时几乎翻倍甚至更长。
- GPU 始终运行在
cuda:0说明不是推理设备问题,但性能劣化会一直持续,除非重启进程。
根本原因
- FunASR 的
AutoModel将运行时配置(kwargs、vad_kwargs、punc_kwargs、spk_kwargs等)保存为可变字典。 - 长音频推理过程中,这些字典会被修改(例如
ncpu默认是 4, 但同时运行的内部逻辑修改torch_threads,推理结束后torch_threads变为72)。由于 FunASR 不会恢复默认配置,下一次请求就会沿用污染过的状态,导致速度下降。
解决方案
- 在
AutoModel构建完所有模块后,立即对每个*_kwargs做快照,并在每次推理前恢复这份基准配置(包括 VAD、标点、说话人识别模块)。 - 重新写入期望的参数如
ncpu,仅在线程设置发生变化时调用torch.set_num_threads(),防止线程数漂移。 - 效果:长音频可多次连续推理而不会污染默认参数,性能恢复稳定。
Summary of Changes
Hello @MotorBottle, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request resolves a performance degradation issue in FunASR's AutoModel during long audio GPU inferencing. The problem stemmed from mutable runtime configuration dictionaries (kwargs) that were not reset between inference calls, leading to "state leaks" where parameters like the number of CPU threads (ncpu) would drift and negatively impact subsequent runs. The solution involves snapshotting the initial clean configuration of these dictionaries and restoring them before each inference, along with robust management of CPU thread settings, to ensure consistent and stable performance.
Highlights
- State Leak Prevention: Implemented a mechanism to snapshot and restore
kwargsconfigurations forAutoModeland its submodules (VAD, Punctuation, Speaker Diarization) before each inference, preventing runtime state modifications from affecting subsequent runs. - CPU Thread Management: Introduced a helper function
_resolve_ncpuand logic to ensurencpu(number of CPU threads) is consistently applied and reset, only callingtorch.set_num_threads()when necessary to prevent thread count drift. - Performance Stability: Addresses a reported issue where long audio inferencing performance degraded significantly after the initial run due to
ncpustate leaks, ensuring stable and consistent performance across multiple inferences.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.
My test with my fork was successful pip install --no-cache-dir git+https://github.com/MotorBottle/FunASR.git@main
Before Processing a long audio:
After processing, unexpected change happened to torch_threads param:
Re-run the processing, could see the arg got reapplied from stored value (avoiding the contamination):