inference ChatTTS & CosyVoice2-0.5B 都碰到一样的错 Couldn't allocate AVFormatContext

System Info / 系統信息

Ubuntu 22
NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0
Conda 25.7.0
Python 3.11

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

[ ] docker / docker
[x] pip install / 通过 pip install 安装
[ ] installation from source / 从源码安装

Version info / 版本信息

Version: 1.11.0.post1

The command used to start Xinference / 用以启动 xinference 的命令

HF_ENDPOINT=https://hf-mirror.com xinference-local --host 0.0.0.0

Reproduction / 复现过程

在干净的conda环境里只安装 audio 包 pip install xinference[audio] 启动
在UI界面启动 ChatTTS 或者 CosyVoice2-0.5B 然后launch UI,做简单的tts测试
这两个模型都报了相同的错 Couldn't allocate AVFormatContext. The destination file is <_io.BytesIO object at 0x......>, check the desired extension? Invalid argument
查到 https://github.com/xorbitsai/inference/issues/2739 说的 conda install -c conda-forge "ffmpeg<7" 已执行，还是一样报错
目前我只有F5-TTS跑起来了

Expected behavior / 期待表现

ChatTTS 或者 CosyVoice2-0.5B 能正常运行

Oct 23 '25 10:10 qiulang

跑了下 cosyvoice 没有问题。

你这个错误看上去像是 ffmpeg 导致的。

Oct 24 '25 03:10 qinxuye

我是按你建议的 conda install -c conda-forge "ffmpeg<7" 请教一下应该用哪个版本好呢？

Oct 24 '25 12:10 qiulang

@qinxuye 我租了一台干净的机器，安装重测，还是跑不起来, cosyvoice 报错, 所以你到底为什么能跑起来，因为我已经在两台全新的机器试验了，一样的错

  File "/home/vllm/miniconda3/envs/vllm_env/lib/python3.11/site-packages/torchcodec/_core/ops.py", line 69, in load_torchcodec_shared_libraries
    raise RuntimeError(
      ^^^^^^^^^^^^^^^^^
RuntimeError: [address=0.0.0.0:36463, pid=8918] Could not load libtorchcodec. Likely causes:
          1. FFmpeg is not properly installed in your environment. We support
             versions 4, 5, 6 and 7.
          2. The PyTorch version (2.9.0+cu128) is not compatible with
             this version of TorchCodec. Refer to the version compatibility
             table:
             https://github.com/pytorch/torchcodec?tab=readme-ov-file#installing-torchcodec.
          3. Another runtime dependency; see exceptions below.
        The following exceptions were raised as we tried to load libtorchcodec:

[start of libtorchcodec loading traceback]

这台新机器上我按照 https://github.com/meta-pytorch/torchcodec?tab=readme-ov-file#installing-torchcodec 说明，conda install "ffmpeg<8"

(vllm_env) vllm@iZ2zedd5pe69teumom3tcoZ:~/.xinference/logs/local_1761371197520$ ffmpeg -version
ffmpeg version 6.1.1 Copyright (c) 2000-2023 the FFmpeg developers
built with gcc 11.2.0 (Anaconda gcc)
configuration: --prefix=/home/vllm/miniconda3/envs/vllm_env --cc=/home/task_175975072617190/conda-bld/ffmpeg_1759751615537/_build_env/bin/x86_64-conda-linux-gnu-cc --ar=/home/task_175975072617190/conda-bld/ffmpeg_1759751615537/_build_env/bin/x86_64-conda-linux-gnu-ar --nm=/home/task_175975072617190/conda-bld/ffmpeg_1759751615537/_build_env/bin/x86_64-conda-linux-gnu-nm --ranlib=/home/task_175975072617190/conda-bld/ffmpeg_1759751615537/_build_env/bin/x86_64-conda-linux-gnu-ranlib --strip=/home/task_175975072617190/conda-bld/ffmpeg_1759751615537/_build_env/bin/x86_64-conda-linux-gnu-strip --disable-doc --enable-swresample --enable-swscale --enable-openssl --enable-libxml2 --enable-libtheora --enable-demuxer=dash --enable-postproc --enable-hardcoded-tables --enable-libfreetype --enable-libharfbuzz --enable-libfontconfig --enable-libdav1d --enable-zlib --enable-libaom --enable-pic --enable-shared --disable-static --disable-gpl --enable-version3 --disable-sdl2 --enable-libopenh264 --enable-libopus --enable-libmp3lame --enable-libopenjpeg --enable-libvorbis --enable-pthreads --enable-libtesseract --enable-libvpx
libavutil      58. 29.100 / 58. 29.100
libavcodec     60. 31.102 / 60. 31.102
libavformat    60. 16.100 / 60. 16.100
libavdevice    60.  3.100 / 60.  3.100
libavfilter     9. 12.100 /  9. 12.100
libswscale      7.  5.100 /  7.  5.100
libswresample   4. 12.100 /  4. 12.100

最后还是报错 RuntimeError: Couldn't allocate AVFormatContext. The destination file is <_io.BytesIO object at 0x7f3028567f10>, check the desired extension? Invalid argument

Oct 25 '25 06:10 qiulang

我终于找到原因了！如果直接 pip install xinference[audio] torch 和 torchcodec 都是最新版本

torch                      2.9.0
torch-complex              0.4.4
torchaudio                 2.9.0
torchcodec                 0.8.0

xinference 目前版本应该是没法在这些版本下工作的，我尝试想改 ..model/audio/utils.py 没有成功, 所以干脆降级

uv pip install torch==2.8.0 torchaudio==2.8.0 torchvision==0.23.0 torchcodec==0.7.0

就可以了

Oct 25 '25 13:10 qiulang

好的，最新版本我们看下什么原因。

Oct 26 '25 11:10 qinxuye

更正一下 torchvision==0.23.0 是视频用不到。

我觉得问题是 torchaudio.save https://github.com/pytorch/audio/releases/tag/v2.9.0

torchaudio.load() and torchaudio.save() still exist, but their underlying implementation now relies on TorchCodec.

所以我尝试想改 ./xinference/model/audio/utils.py

def audio_to_bytes(response_format: str, sample_rate: int, tensor: "torch.Tensor"):
    import torchaudio

    response_pcm = response_format.lower() == "pcm"
    with io.BytesIO() as out:
        if response_pcm:
            logger.info(f"PCM output, num_channels: 1, sample_rate: {sample_rate}")
            torchaudio.save(out, tensor, sample_rate, format="wav", encoding="PCM_S")
            # http://soundfile.sapp.org/doc/WaveFormat
            return _extract_pcm_from_wav_bytes(out.getvalue())
        else:
            torchaudio.save(out, tensor, sample_rate, format=response_format)
            return out.getvalue()

Oct 26 '25 13:10 qiulang

另外有个建议这段写得太随意，我之前一个问题是因为不用 pynini 2.6 造成 python 3.12不能工作这次是 torchaudio 不写和 torch 版本造成问题

audio =
    funasr==1.2.7
    omegaconf~=2.3.0
    nemo_text_processing<1.1.0; sys_platform == 'linux'  # 1.1.0 requires pynini==2.1.6.post1
    WeTextProcessing<1.0.4; sys_platform == 'linux'  # 1.0.4 requires pynini==2.1.6
    librosa
    xxhash
    torchaudio
    ChatTTS>=0.2.1
    tiktoken  # For CosyVoice, openai-whisper
    torch>=2.0.0  # For CosyVoice, matcha

在没有支持 torchaudio2.9之前，最好这样改下

    torch>=2.0.0,<2.9.0  # Pin to <2.9.0 to avoid BytesIO issues with torchcodec
    torchaudio>=2.0.0,<2.9.0  # Pin to <2.9.0 to maintain compatibility
    torchcodec>=0.6.0,<0.8.0  # Compatible with torch 2.8

Oct 26 '25 14:10 qiulang

@qinxuye torch==2.9.0 torchaudio==2.9.0 我尝试了一下这么改好像可以

def audio_to_bytes(response_format: str, sample_rate: int, tensor: "torch.Tensor"):
    import soundfile as sf
    
    response_pcm = response_format.lower() == "pcm"
    
    # Convert tensor to numpy and transpose to [time, channel] for soundfile
    audio_np = tensor.cpu().numpy().T if tensor.ndim == 2 else tensor.cpu().numpy()
    
    with io.BytesIO() as out:
        if response_pcm:
            logger.info(f"PCM output, num_channels: 1, sample_rate: {sample_rate}")
            sf.write(out, audio_np, sample_rate, format="WAV", subtype="PCM_16")
            return _extract_pcm_from_wav_bytes(out.getvalue())
        else:
            sf.write(out, audio_np, sample_rate, format=response_format.upper())
            return out.getvalue()

就是把 torchaudio.save() 换成 sf.write()

我写了一个测试代码，通过输出警告，得出这个改动

(test_torch29) vllm@VM-3-219-ubuntu:~/f5-tts-test$ python test_audio_save.py
Testing torchaudio.save() with BytesIO...
torch version: 2.8.0+cu128
torchaudio version: 2.8.0+cu128
/home/vllm/miniconda3/envs/test_torch29/lib/python3.11/site-packages/torchaudio/_backend/utils.py:337: 
UserWarning: In 2.9, this function's implementation will be changed to use torchaudio.save_with_torchcodec` under the hood. 
Some parameters like format, encoding, bits_per_sample, buffer_size, and ``backend`` will be ignored. We recommend that you port your code to rely directly on TorchCodec's encoder instead: 
https://docs.pytorch.org/torchcodec/stable/generated/torchcodec.encoders.AudioEncoder
  warnings.warn(
/home/vllm/miniconda3/envs/test_torch29/lib/python3.11/site-packages/torchaudio/_backend/ffmpeg.py:247: UserWarning: torio.io._streaming_media_encoder.StreamingMediaEncoder has been deprecated. This deprecation is part of a large refactoring effort to transition TorchAudio into a maintenance phase. The decoding and encoding capabilities of PyTorch for both audio and video are being consolidated into TorchCodec. Please see https://github.com/pytorch/audio/issues/3902 for more information. It will be removed from the 2.9 release.
  s = torchaudio.io.StreamWriter(uri, format=muxer, buffer_size=buffer_size)
✓ torchaudio.save with BytesIO: SUCCESS
✓ soundfile workaround: SUCCESS (32044 bytes)

Oct 27 '25 02:10 qiulang

这是我的测试代码,在 2.9和 2.8下分别执行，看到结果

# test_audio_save.py
import io
import torch
import torchaudio

# Create sample audio tensor
sample_rate = 16000
duration = 1  # 1 second
tensor = torch.randn(1, sample_rate * duration)  # [channel, time]

print("Testing torchaudio.save() with BytesIO...")
print(f"torch version: {torch.__version__}")
print(f"torchaudio version: {torchaudio.__version__}")

# Test 1: Try torchaudio.save with BytesIO (should fail in 2.9)
try:
    with io.BytesIO() as out:
        torchaudio.save(out, tensor, sample_rate, format="wav")
        print("✓ torchaudio.save with BytesIO: SUCCESS")
except Exception as e:
    print(f"✗ torchaudio.save with BytesIO: FAILED")
    print(f"  Error: {e}")

# Test 2: Try soundfile workaround
try:
    import soundfile as sf
    import numpy as np
    
    audio_np = tensor.cpu().numpy()
    if audio_np.ndim == 2:
        audio_np = audio_np.T
    
    with io.BytesIO() as out:
        sf.write(out, audio_np, sample_rate, format="WAV")
        result = out.getvalue()
        print(f"✓ soundfile workaround: SUCCESS ({len(result)} bytes)")
except Exception as e:
    print(f"✗ soundfile workaround: FAILED")
    print(f"  Error: {e}")

Oct 27 '25 02:10 qiulang

@qinxuye 发现要支持 torchaudio 2.9 比我想象的难：

FishSpeech-1.5 会报 module 'torchaudio' has no attribute 'list_audio_backends'的，我读了 torchaudio 2.8 和 2.9的代码差异，发现 list_audio_backends 就是在2.9去掉的
CosyVoice2-0.5B 需要在安装 wetext 包， ChatTTS 需要把transformers 从最新版降到 transformers==4.53.2 才能launch，然后这两个TTS在生成语音会报 Couldn't allocate AVFormatContext 的错，修改可以按我之前说的

def audio_to_bytes(response_format: str, sample_rate: int, tensor: "torch.Tensor"):
    import soundfile as sf
    
    response_pcm = response_format.lower() == "pcm"
    
    # Convert tensor to numpy and transpose to [time, channel] for soundfile
    audio_np = tensor.cpu().numpy().T if tensor.ndim == 2 else tensor.cpu().numpy()
    
    with io.BytesIO() as out:
        if response_pcm:
            logger.info(f"PCM output, num_channels: 1, sample_rate: {sample_rate}")
            sf.write(out, audio_np, sample_rate, format="WAV", subtype="PCM_16")
            return _extract_pcm_from_wav_bytes(out.getvalue())
        else:
            sf.write(out, audio_np, sample_rate, format=response_format.upper())
            return out.getvalue()

index-tts 会报 cannot import name 'SequenceSummary' from 'transformers.modeling_utils'

所以我没有提交代码改动的PR，只提交了 https://github.com/xorbitsai/inference/pull/4178/ 先所以 2.8的版本

Oct 27 '25 13:10 qiulang

This issue is stale because it has been open for 7 days with no activity.

Nov 03 '25 19:11 github-actions[bot]

This issue is stale because it has been open for 7 days with no activity.

Nov 11 '25 19:11 github-actions[bot]

This issue was closed because it has been inactive for 5 days since being marked as stale.

Nov 16 '25 19:11 github-actions[bot]

@qiulang 想请教下，为啥不能修改音色，选择英文回复也只能出中文的语音

Nov 18 '25 04:11 chenyucong1