Retrieval-based-Voice-Conversion-WebUI hubert编码的数据，使用onnx推理出来前/后存在杂音

我在使用onnx_inference_demo.py测试onnx的实时变声推理效果。由于需要满足延迟小的要求，因此测试时会将读入的音频文件按照一片6400 pcm的大小切分。然后循环送入demo中，但发现推理后，每片输出数据的开始和结尾处会存在杂音。

怀疑是hubert在对片段的前后部分编码时由于没有参考帧，所以编成了杂音。我使用的编码器模型为vec-768-layer-12.onnx 使用的音色模型是用RVC自带的export_onnx.py导出的动态Onnx模型（参考isses1830解决）

请问是否有方法解决这个杂音问题？

Aug 23 '24 08:08 EbanShen

单次送入模型的数据点会和上一次送入的有重叠吗？（类似于STFT的重叠）

Sep 09 '24 14:09 UkiTenzai

The encoder model I use is vec-768-layer-12.onnx

How were you able to get it working? In #2298 we have a problem of ONNX not being able to reshape input tensor.

Sep 12 '24 06:09 samolego

The encoder model I use is vec-768-layer-12.onnx

How were you able to get it working? In #2298 we have a problem of ONNX not being able to reshape input tensor.

The problem with this reshape input tensor is that the model you generate is static. You should be using the export_onnx.py onnx model that comes with RVC? According to the warning information during conversion, we can locate the RVC_PATH/infer/lib/infer_pack/attentions.py file, remove all the code that forcibly converts int (only format conversion), and generate it again to get the correct dynamic onnx model. Just use vec-256-* or vec-768-* depending on your model version (v1 or v2) to solve the reshape problem.

Sep 12 '24 13:09 EbanShen

The encoder model I use is vec-768-layer-12.onnx

How were you able to get it working? In #2298 we have a problem of ONNX not being able to reshape input tensor.

The problem with this reshape input tensor is that the model you generate is static. You should be using the export_onnx.py onnx model that comes with RVC? According to the warning information during conversion, we can locate the RVC_PATH/infer/lib/infer_pack/attentions.py file, remove all the code that forcibly converts int (only format conversion), and generate it again to get the correct dynamic onnx model. Just use vec-256-* or vec-768-* depending on your model version (v1 or v2) to solve the reshape problem.

你好，我修改了/infer/lib/infer_pack/attentions.py 中的int64类型转换代码，现在export支持动态输入了。但是使用转onnx模型加ort推理出来的结果和直接用pt加载模型，eval()，然后forward()输出的结果误差很大。rtrvc.py 和 model_onnx.py里的利用随机值我也替换成了0，结果还是差异很大。输入是按照20ms长的16k数据做完特征提取后的维度送进去的。输入数据特征维度保持使用cache部分经过hubert特征提取之后的维度，时间长度分别为35 帧，phone, pitch, pitchf/nssf0 是7帧，输出采用7。这里参考models.py infer()部分，修改成一致的了，models.py代码逻辑如下：

修改后，model 推理输入的维度分别为：

Jan 13 '25 08:01 zhixingheyixsh