fish-speech WebUI generated the audio,but can not play it.

Self Checks

[X] This template is only for bug reports. For questions, please visit Discussions.
[X] I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. English 中文日本語 Portuguese (Brazil)
[X] I have searched for existing issues, including closed ones. Search issues
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template and fill in all required fields.

Cloud or Self Hosted

Self Hosted (Source)

Environment Details

CentOS7, python=3.11, torch==2.4.1+cu124 torchvision==0.19.1+cu124 torchaudio==2.4.1+cu124,gradio=5.7.0

Steps to Reproduce

1.Run the WebUI

python tools/webui.py \
 --llama-checkpoint-path checkpoints/fish-speech-1.4 \
   --decoder-checkpoint-path checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth

2.Synthesize audio based on TEXT and Reference Audio on webUI

Prepare and configure the “Input Text”、“Reference Audio”, then click the “Generate”.

✔️ Expected Behavior

WebUI I can synthesize the audio and can play on this webpage.

❌ Actual Behavior

While in the WebUI I can synthesize the audio,but can not play, except I download the audio file to localhost.

I have read the 《 webui is failed to generate audio, no error reports on the backend log #610 》，but there is no answer.

Nov 28 '24 06:11 allstable

GPU type: NVIDIA A10

Nov 28 '24 06:11 allstable

The WebUI log：

2024-11-28 01:44:26.544 | INFO /myplc/.aigc/miniconda3/envs/f with autocast(enabled = False): 2024-11-28 01:44:26.802 | INFO 2024-11-28 01:44:26.805 | INFO 2024-11-28 01:44:26.806 | INFO 2024-11-28 01:44:26.807 | INFO 2024-11-28 01:44:26.807 | INFO 0%| self.gen = func(*args, **kwds) 5%|██████▎ 2024-11-28 01:44:35.657 | INFO 2024-11-28 01:44:35.658 | INFO 2024-11-28 01:44:35.658 | INFO 2024-11-28 01:44:35.658 | INFO 2024-11-28 01:44:35.661 | INFO 8%|██████████▉ 2024-11-28 01:44:50.460 | INFO 2024-11-28 01:44:50.460 | INFO 2024-11-28 01:44:50.461 | INFO 2024-11-28 01:44:50.461 | INFO 2024-11-28 01:44:50.464 | INFO 2%|██▊ 2024-11-28 01:44:54.086 | INFO 2024-11-28 01:44:54.086 | INFO 2024-11-28 01:44:54.087 | INFO 2024-11-28 01:44:54.088 | INFO /myplc/.aigc/miniconda3/envs/f warnings.warn(warning.format(d | tools.api:encode_reference:167 - Loaded audio with 5.42 seconds ish-speech/lib/python3.11/site-packages/vector_quantize_pytorch/residual_fsq.py:170: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead. | tools.api:encode_reference:175 - Encoded prompt: torch.Size([8, 117]) | tools.llama.generate:generate_long:759 - Encoded text: 这学校真不是一般的寒酸，统共只有一幢楼房，两层高，楼下是教室，楼上是办公室。 | tools.llama.generate:generate_long:759 - Encoded text: 六间教室，一年级和二年级八个班的学生只能轮番上课，读到三年级就直接送到工厂里去实习，找不到实习单位就在家睡觉，搞得像山区小学一样。 | tools.llama.generate:generate_long:759 - Encoded text: 该校没有操场，体育老师倒有三个。 | tools.llama.generate:generate_long:777 - Generating sentence 1/3 of sample 1/1 | 0/3914 [00:00<?, ?it/s]/myplc/.aigc/miniconda3/envs/fish-speech/lib/python3.11/contextlib.py:105: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature. | 177/3914 [00:08<03:05, 20.16it/s] | tools.llama.generate:generate_long:832 - Generated 179 tokens in 8.85 seconds, 20.23 tokens/sec | tools.llama.generate:generate_long:835 - Bandwidth achieved: 10.00 GB/s | tools.llama.generate:generate_long:840 - GPU Memory used: 1.38 GB | tools.llama.generate:generate_long:777 - Generating sentence 2/3 of sample 1/1 | tools.api:decode_vq_tokens:189 - VQ features: torch.Size([8, 178]) | 287/3685 [00:14<02:53, 19.59it/s] | tools.llama.generate:generate_long:832 - Generated 289 tokens in 14.80 seconds, 19.53 tokens/sec | tools.llama.generate:generate_long:835 - Bandwidth achieved: 9.66 GB/s | tools.llama.generate:generate_long:840 - GPU Memory used: 1.54 GB | tools.llama.generate:generate_long:777 - Generating sentence 3/3 of sample 1/1 | tools.api:decode_vq_tokens:189 - VQ features: torch.Size([8, 288]) | 68/3375 [00:03<02:48, 19.60it/s] | tools.llama.generate:generate_long:832 - Generated 70 tokens in 3.62 seconds, 19.32 tokens/sec | tools.llama.generate:generate_long:835 - Bandwidth achieved: 9.55 GB/s | tools.llama.generate:generate_long:840 - GPU Memory used: 1.64 GB | tools.api:decode_vq_tokens:189 - VQ features: torch.Size([8, 69]) ish-speech/lib/python3.11/site-packages/gradio/processing_utils.py:738: UserWarning: Trying to convert audio automatically from float32 to 16-bit int format. ata.dtype))

Nov 28 '24 06:11 allstable

This issue is stale because it has been open for 30 days with no activity.

Dec 29 '24 00:12 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

Jan 12 '25 00:01 github-actions[bot]