fish-speech fish-speech.tools.api_server --compile Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run.

Self Checks

[x] This template is only for bug reports. For questions, please visit Discussions.
[x] I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. English 中文日本語 Portuguese (Brazil)
[x] I have searched for existing issues, including closed ones. Search issues
[x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[x] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[x] Please do not modify this template and fill in all required fields.

Cloud or Self Hosted

Self Hosted (Source)

Environment Details

Windows 10, Python3.11, torch==2.6.0+cu126, latest Triton for windows

Steps to Reproduce

I run the command:

python -m fish-speech.tools.api_server --listen 0.0.0.0:8080 --llama-checkpoint-path "checkpoints/fish-speech-1.5" --decoder-checkpoint-path "checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth" --decoder-config-name firefly_gan_vq --compile

✔️ Expected Behavior

I expect fish speech server to run, and compile with Torch so it can be fast ( i need realtime tts )

❌ Actual Behavior

INFO: Started server process [29352] INFO: Waiting for application startup. 2025-05-07 13:21:20.841 | INFO | fish_speech.models.text2semantic.inference:load_model:683 - Restored model from checkpoint 2025-05-07 13:21:20.841 | INFO | fish_speech.models.text2semantic.inference:load_model:689 - Using DualARTransformer 2025-05-07 13:21:20.842 | INFO | fish_speech.models.text2semantic.inference:load_model:697 - Compiling function... 2025-05-07 13:21:20.907 | INFO | tools.server.model_manager:load_llama_model:99 - LLAMA model loaded. D:\Python\Python311\Lib\site-packages\vector_quantize_pytorch\vector_quantize_pytorch.py:445: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead. @autocast(enabled = False) D:\Python\Python311\Lib\site-packages\vector_quantize_pytorch\vector_quantize_pytorch.py:630: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead. @autocast(enabled = False) D:\Python\Python311\Lib\site-packages\vector_quantize_pytorch\finite_scalar_quantization.py:147: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead. @autocast(enabled = False) D:\Python\Python311\Lib\site-packages\vector_quantize_pytorch\lookup_free_quantization.py:209: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead. @autocast(enabled = False) 2025-05-07 13:21:23.808 | INFO | fish_speech.models.vqgan.inference:load_model:46 - Loaded model: <All keys matched successfully> 2025-05-07 13:21:23.809 | INFO | tools.server.model_manager:load_decoder_model:107 - Decoder model loaded. 2025-05-07 13:21:23.824 | INFO | fish_speech.models.text2semantic.inference:generate_long:790 - Encoded text: Hello world. 2025-05-07 13:21:23.826 | INFO | fish_speech.models.text2semantic.inference:generate_long:808 - Generating sentence 1/1 of sample 1/1 0%| | 0/1023 [00:00<?, ?it/s]D:\Python\Python311\Lib\contextlib.py:105: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature. self.gen = func(*args, **kwds) 0%| | 1/1023 [03:45<64:03:51, 225.67s/it]D:\Python\Python311\Lib\contextlib.py:105: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature. self.gen = func(*args, **kwds) 0%| | 1/1023 [03:45<64:04:06, 225.68s/it] ERROR: Traceback (most recent call last): File "D:\Python\Python311\Lib\site-packages\kui\asgi\lifespan.py", line 36, in call await result File "D:\2025\Call Center Agent X\fish-speech\tools\api_server.py", line 83, in initialize_app app.state.model_manager = ModelManager( ^^^^^^^^^^^^^ File "D:\2025\Call Center Agent X\fish-speech\tools\server\model_manager.py", line 65, in init self.warm_up(self.tts_inference_engine) File "D:\2025\Call Center Agent X\fish-speech\tools\server\model_manager.py", line 121, in warm_up list(inference(request, tts_inference_engine)) File "D:\2025\Call Center Agent X\fish-speech\tools\server\inference.py", line 25, in inference_wrapper raise HTTPException( baize.exceptions.HTTPException: (500, ''Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run. Stack trace: File "D:\\2025\\Call Center Agent X\\fish-speech\\fish_speech\\models\\text2semantic\\inference.py", line 307, in decode_one_token_ar\n codebooks = torch.stack(codebooks, dim=0). To prevent overwriting, clone the tensor outside of torch.compile() or call torch.compiler.cudagraph_mark_step_begin() before each model invocation.'')

ERROR: Application startup failed. Exiting.

May 07 '25 11:05 corporate9601

Excuse me, has this problem been resolved?

May 08 '25 01:05 l137295

我也有这个问题。但是python tools/api_server.py --listen 0.0.0.0:7860 --compile 这样启动不会报错。但是首次合成声音的时候，也会报相同的错误。不过后面可以继续使用声音合成。不受影响，并且确实速度快很多

May 10 '25 14:05 libo5410391

说错了。python tools/run_webui.py --compile 是这个

May 10 '25 14:05 libo5410391

This issue is stale because it has been open for 30 days with no activity.

Jun 10 '25 00:06 github-actions[bot]

Hello @corporate9601 @libo5410391,

Has anyone managed to solve the issue? Your help will be much appreciated, thanks.

Jun 21 '25 08:06 mostafaramadann

I have made a temporary solution commented out the warmup execution and then sent two requests. Afterwards the server worked great.

Jun 21 '25 16:06 mostafaramadann

This issue is stale because it has been open for 30 days with no activity.

Jul 23 '25 00:07 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

Aug 06 '25 00:08 github-actions[bot]

Hi @mostafaramadann, can you share what you did? I got rid of the warmup by commenting out the self.warm_up call in tools/server/model_manager.py, but when I send a request I get this error.

Oh, I see what you mean. You ran two requests, both of which fail with an error. And then the third one and subsequent works. There must be a better way.

Sep 07 '25 05:09 mashdragon

To actually solve this, your torch version is too recent, you need torch<2.5.1 due to changes in torch.compile function. Unfortunately old torch does not work for new CUDA versions :( I will keep crashing it for now

Sep 07 '25 06:09 mashdragon

Hi @mostafaramadann, can you share what you did? I got rid of the warmup by commenting out the self.warm_up call in tools/server/model_manager.py, but when I send a request I get this error.

Oh, I see what you mean. You ran two requests, both of which fail with an error. And then the third one and subsequent works. There must be a better way.

exactly!

Sep 12 '25 13:09 mostafaramadann