FunASR Work around FunASR kwargs state leaks

Long-Audio Slowdown in FunASR GPU Inferencing (Root cause: kwargs state leaks)

What I Observed

First pass on a 30 min+ recording finishes quickly, but running the same clip again almost doubles the time (sometimes even longer).
GPU stays on cuda:0 throughout (so it was not backend issue); the slowdown persists until the process is restarted.

Root Cause

FunASR's Automodel keeps runtime configuration (kwargs, vad_kwargs, punc_kwargs, spk_kwargs, etc.) in mutable dictionaries.
Long inferences mutate those dicts (e.g., torch_threads grows from the default 4 to the host’s 72 threads on my server, slowing down inference). FunASR never resets them, so the next request inherits the “dirty” state and slows down.

Fix

Snapshot every *_kwargs right after AutoModel builds its modules and restore that baseline before each inference (including VAD, punctuation, diarization).
Reapply the intended values like ncpu and only call torch.set_num_threads() when needed, preventing thread drift.
Result: even recordings can be processed repeatedly without the default params getting contaminated.

FunASR 长音频GPU推理降速问题（根本原因：AutoModel 在初始时把所有运行配置放在同一个全局 kwargs 字典里，多模型推理时，这个字典会被内部逻辑实时修改，例如调整 batch_size、ncpu 等参数，推理结束后不会恢复原值）

观察现象

处理 30 分钟以上音频时，发现第一次推理速度很快，但同一段音频第二次推理耗时几乎翻倍甚至更长。
GPU 始终运行在 cuda:0说明不是推理设备问题，但性能劣化会一直持续，除非重启进程。

根本原因

FunASR 的 AutoModel 将运行时配置（kwargs、vad_kwargs、punc_kwargs、spk_kwargs 等）保存为可变字典。
长音频推理过程中，这些字典会被修改（例如 ncpu 默认是 4，但同时运行的内部逻辑修改torch_threads，推理结束后torch_threads变为72）。由于 FunASR 不会恢复默认配置，下一次请求就会沿用污染过的状态，导致速度下降。

解决方案

在 AutoModel 构建完所有模块后，立即对每个 *_kwargs 做快照，并在每次推理前恢复这份基准配置（包括 VAD、标点、说话人识别模块）。
重新写入期望的参数如 ncpu，仅在线程设置发生变化时调用 torch.set_num_threads()，防止线程数漂移。
效果：长音频可多次连续推理而不会污染默认参数，性能恢复稳定。

Oct 16 '25 14:10 MotorBottle

Summary of Changes

Hello @MotorBottle, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a performance degradation issue in FunASR's AutoModel during long audio GPU inferencing. The problem stemmed from mutable runtime configuration dictionaries (kwargs) that were not reset between inference calls, leading to "state leaks" where parameters like the number of CPU threads (ncpu) would drift and negatively impact subsequent runs. The solution involves snapshotting the initial clean configuration of these dictionaries and restoring them before each inference, along with robust management of CPU thread settings, to ensure consistent and stable performance.

Highlights

State Leak Prevention: Implemented a mechanism to snapshot and restore kwargs configurations for AutoModel and its submodules (VAD, Punctuation, Speaker Diarization) before each inference, preventing runtime state modifications from affecting subsequent runs.
CPU Thread Management: Introduced a helper function _resolve_ncpu and logic to ensure ncpu (number of CPU threads) is consistently applied and reset, only calling torch.set_num_threads() when necessary to prevent thread count drift.
Performance Stability: Addresses a reported issue where long audio inferencing performance degraded significantly after the initial run due to ncpu state leaks, ensuring stable and consistent performance across multiple inferences.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Oct 16 '25 14:10 gemini-code-assist[bot]

My test with my fork was successful pip install --no-cache-dir git+https://github.com/MotorBottle/FunASR.git@main

Before Processing a long audio:

After processing, unexpected change happened to torch_threads param:

Re-run the processing, could see the arg got reapplied from stored value (avoiding the contamination):

Oct 29 '25 14:10 MotorBottle

FunASR FunASR copied to clipboard

Work around FunASR kwargs state leaks

Long-Audio Slowdown in FunASR GPU Inferencing (Root cause: kwargs state leaks)

What I Observed

Root Cause

Fix

FunASR 长音频GPU推理降速问题（根本原因：AutoModel 在初始时把所有运行配置放在同一个全局 kwargs 字典里，多模型推理时，这个字典会被内部逻辑实时修改，例如调整 batch_size、ncpu 等参数，推理结束后不会恢复原值）

观察现象

根本原因

解决方案

Summary of Changes

Highlights

FunASR
FunASR copied to clipboard