GPT-SoVITS
GPT-SoVITS copied to clipboard
修改 ASR 工具
简要内容 (太长不看版):
- 重构 ASR 工具, 保留原有功能, 新增识别单文件, 适配 CPU, 并提供更详细的输出信息;
- 调整 ASR 工具的加载时点和顺序;
- 调整 WebUI 界面 ASR 部分的布局.
- WebUI 运行 ASR 前对输入路径和输出路径进行 os.path.normpath, 去掉多余的分隔符, 修复 #481
详细内容:
-
tools/asr/config.py
: 创建 ASR 基类BaseASR
, 提供 ASR 工具的基本结构:-
check_local_model()
: 输入模型名称, 检索tools/asr/models
和本地缓存~/.cache
下, 返回相应路径;- 注意: 路径需要满足: 含有
模型名+分隔符
, 且模型权重文件如model.bin
在当前目录下, 判定相对严格, 模型文件夹添加后缀将无法识别;
- 注意: 路径需要满足: 含有
-
load_model()
: 留空, 加载模型; -
inference()
: 留空, 单文件推理, 输出文本; -
inference_file_or_folder()
: 对文件/文件夹进行推理, 获取文本后进行信息汇总后保存结果, 输出文件路径重复将自动添加时间戳以进行区分.
-
-
tools/asr/funasr_asr.py
: 继承BaseASR
创建FunASR
类:-
check_local_models()
: 检测是否有本地模型, 三个子模型依次调用基类的check_local_model()
; -
load_model()
: 加载模型时添加了更详细的日志; -
inference()
: 识别单个文件, 返回相应文本; - 参数解析将输入文件夹
input_folder
修改为input_file_or_folder
, 支持单文件识别.
-
-
tools/asr/fasterwhisper_asr.py
: 继承BaseASR
创建FasterWhisperASR
类:- 运行设备
device
和计算精度precision
对 CPU 进行了适配; -
check_local_models()
: 检测是否有本地模型, 按模型尺寸调用基类的check_local_model()
;- 模型文件夹若在当前项目
tools/asr/models
下, 则下拉选单显示-local
; - 模型文件夹若在本地缓存
~/.cache
下, 则下拉选单显示-cache
; - 都不存在则无后缀, 会自动下载;
- 模型文件夹若在当前项目
-
load_model()
: 加载模型时添加了更详细的日志; -
inference()
: 识别单个文件, 保留语言识别为中文时调用FunASR
实例, 用类属性self.zh_model
避免提前或重复加载FunASR
实例; - 去掉了 wav 格式的限制;
- 参数解析将输入文件夹
input_folder
修改为input_file_or_folder
, 支持单文件识别.
- 运行设备
-
toold/my_utils.py
添加了日志类Tools_Logger
用于记录工具类的日志显示:- 实例化了
ASR_Logger
, 设置了BaseASR
,FunASR
和FasterWhisper
的相关输出信息.
- 实例化了
-
webui.py
:- 调整了 ASR 部分的布局, 并拉长选项区域, 让程序启动按钮相对缩小;
- 输入路径支持单个文件;
- ASR 模型选择部分添加了 info 属性作为简单注释;
- 修复了原先获取配置时提前加载 ASR 模型造成界面启动较慢的问题;
- 在选择 ASR 模型后才进行配置刷新 (FasterWhisper 首次刷新需要等待约6秒).
-
docs/cn/README.md
, 修改了命令行进行 ASR 的使用方法. - 测试:
- [x] 文件类 (修改输入内容):
- [x] 输入文件/文件夹路径不存在:
ASR - ERROR - 输入路径不存在.
- [x] 输入文件不是可用音频:
ASR - ERROR - 当前文件 {输入文件绝对路径} 转写失败.
→ASR - ERROR - 没有转写结果, 放弃保存.
- [x] 输入文件夹内无可用音频:
ASR - ERROR - 没有转写结果, 放弃保存.
- [x] 输入文件夹内多种文件: 仅保留转写成功的结果 →
ASR - INFO - 任务完成->标注文件路径: {输出文件绝对路径}
- [x] 输入文件/文件夹路径不存在:
- [x] 模型类 (丰富打印信息):
- [x] 无本地模型:
ASR - WARNING - 下载模型: 从 {网络链接} 下载 {模型名称} 模型
; - [x] 有本地模型:
ASR - INFO - 加载模型: 从 {本地路径} 加载 {模型名称} 模型
; - [x] 下载/加载失败:
ASR - ERROR 模型加载失败 or 下载失败, 可访问 {网络链接} 自行下载, 并放置于 tools/asr/models/ 文件夹下
→ASR - ERROR - 模型不存在
- [x] 无本地模型:
- [x] 设备类 (适配 CPU):
- [x] FunASR + CPU:
ASR - INFO - 运行设备: CPU, 设定精度: --
- [x] FunASR + GPU:
ASR - INFO - 运行设备: {GPU 名称}, 设定精度: --
. - [x] FasterWhisper + CPU:
ASR - INFO - 运行设备: CPU, 设定精度: float32
. - [x] FasterWhipser + GPU:
ASR - INFO - 运行设备: {GPU 名称}, 设定精度: float16
.
- [x] FunASR + CPU:
- [x] 文件类 (修改输入内容):
TODO:
- [ ] 当前文本写入 .list 文件时为
文件路径|输入文件夹的名称|语言|文本
, 第二项应为speaker
, 是否需要重新定义? - [ ] 考虑完善其他工具的日志输出和文档;
- [ ] 考虑提供计算精度 WebUI 入口;
来点反馈,是不是改得太多了😂
Faster Whisper ASR large模型,生成的list文件里面出现大量的重复字段,例如:
G:\Datasets\GuYunAGI\VerinAudio\slice\12.flac_0011459520_0011566720.wav|slice|EN|And the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is
Faster Whisper ASR large模型,生成的list文件里面出现大量的重复字段,例如:
G:\Datasets\GuYunAGI\VerinAudio\slice\12.flac_0011459520_0011566720.wav|slice|EN|And the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is
方便的话可以将音频文件打包成zip发上来?
Faster Whisper ASR large模型,生成的list文件里面出现大量的重复字段,例如:
G:\Datasets\GuYunAGI\VerinAudio\slice\12.flac_0011459520_0011566720.wav|slice|EN|And the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is
方便的话可以将音频文件打包成zip发上来?
文件211MB超过25MB了,好像无法上传,如果可以的话我想通过QQ发一下,3542761846.
找到原因了,是模型幻觉问题。如果whisper遇到长时间的沉默就会不断重复之前的短语或短句。
segments, info = model.transcribe( audio=file, beam_size=5, vad_filter=True, vad_parameters=dict(min_silence_duration_ms=700), condition_on_previous_text=False, suppress_tokens=[], language=language)
这一部分中我添加了两个参数 condition_on_previous_text=False, suppress_tokens=[]试图抑制幻觉,可以在Webui中添加一个抑制幻觉选项。
Faster Whisper ASR large模型,生成的list文件里面出现大量的重复字段,例如:
G:\Datasets\GuYunAGI\VerinAudio\slice\12.flac_0011459520_0011566720.wav|slice|EN|And the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is, and the time the outside world is
方便的话可以将音频文件打包成zip发上来?
文件211MB超过25MB了,好像无法上传,如果可以的话我想通过QQ发一下,3542761846.
你不是切分后再进行ASR的吗,只需要出问题的那段语音就行吧。
找到原因了,是模型幻觉问题。如果whisper遇到长时间的沉默就会不断重复之前的短语或短句。
segments, info = model.transcribe( audio=file, beam_size=5, vad_filter=True, vad_parameters=dict(min_silence_duration_ms=700), condition_on_previous_text=False, suppress_tokens=[], language=language)
这一部分中我添加了两个参数 condition_on_previous_text=False, suppress_tokens=[]试图抑制幻觉,可以在Webui中添加一个抑制幻觉选项。
长时间沉默是指语音中静音段较长?再进行适当切分应该不会出现此问题?
找到原因了,是模型幻觉问题。如果whisper遇到长时间的沉默就会不断重复之前的短语或短句。
segments, info = model.transcribe( audio=file, beam_size=5, vad_filter=True, vad_parameters=dict(min_silence_duration_ms=700), condition_on_previous_text=False, suppress_tokens=[], language=language)
这一部分中我添加了两个参数 condition_on_previous_text=False, suppress_tokens=[]试图抑制幻觉,可以在Webui中添加一个抑制幻觉选项。长时间沉默是指语音中静音段较长?再进行适当切分应该不会出现此问题?
我切分后的语音有些不到三秒,但依旧有此问题。
顺带一提,
segments, info = model.transcribe( audio=file, beam_size=5, vad_filter=True, vad_parameters=dict(min_silence_duration_ms=700), condition_on_previous_text=False, suppress_tokens=[], language=language)
也不能很好的抑制幻觉,所以需要一些后处理选项。
以下是选了少许出现问题的音频
Uploading slice.zip…
找到原因了,是模型幻觉问题。如果whisper遇到长时间的沉默就会不断重复之前的短语或短句。
segments, info = model.transcribe( audio=file, beam_size=5, vad_filter=True, vad_parameters=dict(min_silence_duration_ms=700), condition_on_previous_text=False, suppress_tokens=[], language=language)
这一部分中我添加了两个参数 condition_on_previous_text=False, suppress_tokens=[]试图抑制幻觉,可以在Webui中添加一个抑制幻觉选项。长时间沉默是指语音中静音段较长?再进行适当切分应该不会出现此问题?
我切分后的语音有些不到三秒,但依旧有此问题。 顺带一提,
segments, info = model.transcribe( audio=file, beam_size=5, vad_filter=True, vad_parameters=dict(min_silence_duration_ms=700), condition_on_previous_text=False, suppress_tokens=[], language=language)
也不能很好的抑制幻觉,所以需要一些后处理选项。 以下是选了少许出现问题的音频 Uploading slice.zip…
文件没传上来,应该等 Uploading 完成后才行
文件冲突, 关闭PR