fish-speech.tools.api_server --compile Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run.
Self Checks
- [x] This template is only for bug reports. For questions, please visit Discussions.
- [x] I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. English 中文 日本語 Portuguese (Brazil)
- [x] I have searched for existing issues, including closed ones. Search issues
- [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [x] Please do not modify this template and fill in all required fields.
Cloud or Self Hosted
Self Hosted (Source)
Environment Details
Windows 10, Python3.11, torch==2.6.0+cu126, latest Triton for windows
Steps to Reproduce
I run the command:
python -m fish-speech.tools.api_server --listen 0.0.0.0:8080 --llama-checkpoint-path "checkpoints/fish-speech-1.5" --decoder-checkpoint-path "checkpoints/fish-speech-1.5/firefly-gan-vq-fsq-8x1024-21hz-generator.pth" --decoder-config-name firefly_gan_vq --compile
✔️ Expected Behavior
I expect fish speech server to run, and compile with Torch so it can be fast ( i need realtime tts )
❌ Actual Behavior
INFO: Started server process [29352]
INFO: Waiting for application startup.
2025-05-07 13:21:20.841 | INFO | fish_speech.models.text2semantic.inference:load_model:683 - Restored model from checkpoint
2025-05-07 13:21:20.841 | INFO | fish_speech.models.text2semantic.inference:load_model:689 - Using DualARTransformer
2025-05-07 13:21:20.842 | INFO | fish_speech.models.text2semantic.inference:load_model:697 - Compiling function...
2025-05-07 13:21:20.907 | INFO | tools.server.model_manager:load_llama_model:99 - LLAMA model loaded.
D:\Python\Python311\Lib\site-packages\vector_quantize_pytorch\vector_quantize_pytorch.py:445: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
@autocast(enabled = False)
D:\Python\Python311\Lib\site-packages\vector_quantize_pytorch\vector_quantize_pytorch.py:630: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
@autocast(enabled = False)
D:\Python\Python311\Lib\site-packages\vector_quantize_pytorch\finite_scalar_quantization.py:147: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
@autocast(enabled = False)
D:\Python\Python311\Lib\site-packages\vector_quantize_pytorch\lookup_free_quantization.py:209: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
@autocast(enabled = False)
2025-05-07 13:21:23.808 | INFO | fish_speech.models.vqgan.inference:load_model:46 - Loaded model: <All keys matched successfully>
2025-05-07 13:21:23.809 | INFO | tools.server.model_manager:load_decoder_model:107 - Decoder model loaded.
2025-05-07 13:21:23.824 | INFO | fish_speech.models.text2semantic.inference:generate_long:790 - Encoded text: Hello world.
2025-05-07 13:21:23.826 | INFO | fish_speech.models.text2semantic.inference:generate_long:808 - Generating sentence 1/1 of sample 1/1
0%| | 0/1023 [00:00<?, ?it/s]D:\Python\Python311\Lib\contextlib.py:105: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.
self.gen = func(*args, **kwds)
0%| | 1/1023 [03:45<64:03:51, 225.67s/it]D:\Python\Python311\Lib\contextlib.py:105: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.
self.gen = func(*args, **kwds)
0%| | 1/1023 [03:45<64:04:06, 225.68s/it]
ERROR: Traceback (most recent call last):
File "D:\Python\Python311\Lib\site-packages\kui\asgi\lifespan.py", line 36, in call
await result
File "D:\2025\Call Center Agent X\fish-speech\tools\api_server.py", line 83, in initialize_app
app.state.model_manager = ModelManager(
^^^^^^^^^^^^^
File "D:\2025\Call Center Agent X\fish-speech\tools\server\model_manager.py", line 65, in init
self.warm_up(self.tts_inference_engine)
File "D:\2025\Call Center Agent X\fish-speech\tools\server\model_manager.py", line 121, in warm_up
list(inference(request, tts_inference_engine))
File "D:\2025\Call Center Agent X\fish-speech\tools\server\inference.py", line 25, in inference_wrapper
raise HTTPException(
baize.exceptions.HTTPException: (500, ''Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run. Stack trace: File "D:\\2025\\Call Center Agent X\\fish-speech\\fish_speech\\models\\text2semantic\\inference.py", line 307, in decode_one_token_ar\n codebooks = torch.stack(codebooks, dim=0). To prevent overwriting, clone the tensor outside of torch.compile() or call torch.compiler.cudagraph_mark_step_begin() before each model invocation.'')
ERROR: Application startup failed. Exiting.
Excuse me, has this problem been resolved?
我也有这个问题。 但是python tools/api_server.py --listen 0.0.0.0:7860 --compile 这样启动不会报错。但是首次合成声音的时候,也会报相同的错误。不过后面可以继续使用声音合成。不受影响,并且确实速度快很多
说错了。python tools/run_webui.py --compile 是这个
This issue is stale because it has been open for 30 days with no activity.
Hello @corporate9601 @libo5410391,
Has anyone managed to solve the issue? Your help will be much appreciated, thanks.
I have made a temporary solution commented out the warmup execution and then sent two requests. Afterwards the server worked great.
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Hi @mostafaramadann, can you share what you did? I got rid of the warmup by commenting out the self.warm_up call in tools/server/model_manager.py, but when I send a request I get this error.
Oh, I see what you mean. You ran two requests, both of which fail with an error. And then the third one and subsequent works. There must be a better way.
To actually solve this, your torch version is too recent, you need torch<2.5.1 due to changes in torch.compile function. Unfortunately old torch does not work for new CUDA versions :( I will keep crashing it for now
Hi @mostafaramadann, can you share what you did? I got rid of the warmup by commenting out the
self.warm_upcall intools/server/model_manager.py, but when I send a request I get this error.Oh, I see what you mean. You ran two requests, both of which fail with an error. And then the third one and subsequent works. There must be a better way.
exactly!