How can I integrate a local TTS API and STT API?
Self Checks
- [X] I have searched for existing issues search for existing issues, including closed ones.
- [X] I confirm that I am using English to submit report (我已阅读并同意 Language Policy).
- [X] Please do not modify this template :) and fill in all the required fields.
Provide a description of requested docs changes
How can I integrate a local TTS API and STT API like edge-tts and whisper? Somebody show me the way, pls.
Hey @zhanghuanhuanlive! 👋 I'm here to help you with any bugs, questions, or contributions. I'm a bot and I can assist you while you wait for a human maintainer.
To integrate a local TTS API like edge-tts and a local STT API like whisper in your project, you can follow this example:
Integrating whisper for Speech-to-Text (STT)
First, create a function to handle the STT using whisper:
import requests
from yarl import URL
class LocalAISpeech2text:
def __init__(self, server_url):
self.server_url = server_url
def transcribe_audio(self, model: str, audio_file_path: str) -> str:
url = str(URL(self.server_url) / "v1/audio/transcriptions")
data = {"model": model}
files = {"file": open(audio_file_path, 'rb')}
response = requests.post(url, data=data, files=files)
response.raise_for_status()
if 'error' in response.json():
raise Exception("Error in transcription: " + response.json()['error'])
return response.json()["text"]
# Example usage
stt = LocalAISpeech2text(server_url="http://localhost:5000")
transcription = stt.transcribe_audio(model="whisper", audio_file_path="path/to/audio/file.wav")
print(transcription)
Integrating edge-tts for Text-to-Speech (TTS)
Next, create a function to handle the TTS using edge-tts:
import edge_tts
async def text_to_speech(text: str, output_file: str):
communicate = edge_tts.Communicate(text, "en-US-JennyNeural")
await communicate.save(output_file)
# Example usage
import asyncio
text = "Hello, this is a test."
output_file = "output_audio.mp3"
asyncio.run(text_to_speech(text, output_file))
Putting It All Together
Create a simple script that uses both the STT and TTS functions:
import asyncio
# STT part
stt = LocalAISpeech2text(server_url="http://localhost:5000")
transcription = stt.transcribe_audio(model="whisper", audio_file_path="path/to/audio/file.wav")
print("Transcription:", transcription)
# TTS part
text = transcription # or any other text
output_file = "output_audio.mp3"
asyncio.run(text_to_speech(text, output_file))
print(f"Audio saved to {output_file}")
This script will transcribe an audio file using whisper and then convert the transcribed text back to speech using edge-tts. Adjust the server URL, model names, and file paths as needed for your specific setup [1][2][3].
To continue talking to Dosu, mention @dosu.
Thanks! Actually, what I want to ask is how to integrate my already developed Whisper and Edge-TTS into the Dify platform for use.
To integrate your already developed Whisper (STT) and Edge-TTS (TTS) into the Dify platform, you need to modify the AudioService class to use your implementations. Here’s how you can do it:
1. Modify AudioService to Use Whisper for STT
# services/audio_service.py
from whisper import Whisper # Import your Whisper implementation
class AudioService:
@staticmethod
def transcript_asr(app_model, file, end_user):
# Use Whisper for speech-to-text
whisper_model = Whisper()
transcript = whisper_model.transcribe(file)
return {'transcript': transcript}
2. Modify AudioService to Use Edge-TTS for TTS
# services/audio_service.py
from edge_tts import EdgeTTS # Import your Edge-TTS implementation
class AudioService:
@staticmethod
def transcript_tts(app_model, text, voice, end_user, streaming):
# Use Edge-TTS for text-to-speech
edge_tts_model = EdgeTTS()
audio_data = edge_tts_model.synthesize(text, voice)
return {'data': audio_data}
3. Ensure API Endpoints Use the Modified AudioService
Ensure that the AudioService methods are correctly called in the AudioApi and TextApi classes.
# api/controllers/web/audio.py
class AudioApi(WebApiResource):
def post(self, app_model: App, end_user):
file = request.files['file']
try:
response = AudioService.transcript_asr(
app_model=app_model,
file=file,
end_user=end_user
)
return response
except Exception as e:
logging.exception(f"internal server error: {str(e)}")
raise InternalServerError()
class TextApi(WebApiResource):
def post(self, app_model: App, end_user):
try:
response = AudioService.transcript_tts(
app_model=app_model,
text=request.form['text'],
end_user=end_user.external_user_id,
voice=request.form['voice'] if request.form.get('voice') else app_model.app_model_config.text_to_speech_dict.get('voice'),
streaming=False
)
return {'data': response['data'].decode('latin1')}
except Exception as e:
logging.exception(f"internal server error: {str(e)}")
raise InternalServerError()
api.add_resource(AudioApi, '/audio-to-text')
api.add_resource(TextApi, '/text-to-audio')
4. Update TypeScript Service to Handle Requests
// web/service/share.ts
export const audioToText = (url: string, isPublicAPI: boolean, body: FormData) => {
return (getAction('post', !isPublicAPI))(url, { body }, { bodyStringify: false, deleteContentType: true }) as Promise<{ text: string }>
}
export const textToAudio = (url: string, isPublicAPI: boolean, body: FormData) => {
return (getAction('post', !isPublicAPI))(url, { body }, { bodyStringify: false, deleteContentType: true }) as Promise<{ data: string }>
}
5. Ensure Configuration Supports Custom Models
Ensure your Dify platform configuration supports the integration of custom models for STT and TTS. Here’s an example configuration:
provider: azure_openai
label:
en_US: Azure OpenAI Service Model
icon_small:
en_US: icon_s_en.svg
icon_large:
en_US: icon_l_en.png
background: "#E3F0FF"
help:
title:
en_US: Get your API key from Azure
zh_Hans: 从 Azure 获取 API Key
url:
en_US: https://azure.microsoft.com/en-us/products/ai-services/openai-service
supported_model_types:
- llm
- text-embedding
- speech2text
- tts
configurate_methods:
- customizable-model
model_credential_schema:
model:
label:
en_US: Deployment Name
zh_Hans: 部署名称
placeholder:
en_US: Enter your Deployment Name here, matching the Azure deployment name.
zh_Hans: 在此输入您的部署名称,与 Azure 部署名称匹配。
credential_form_schemas:
- variable: openai_api_base
label:
en_US: API Endpoint URL
zh_Hans: API 域名
type: text-input
required: true
placeholder:
zh_Hans: '在此输入您的 API 域名,如:https://example.com/xxx'
en_US: 'Enter your API Endpoint, eg: https://example.com/xxx'
- variable: openai_api_key
label:
en_US: API Key
zh_Hans: API Key
type: secret-input
required: true
placeholder:
zh_Hans: 在此输入您的 API Key
en_US: Enter your API key here
- variable: openai_api_version
label:
zh_Hans: API 版本
en_US: API Version
type: select
required: true
options:
- label:
en_US: 2024-02-15-preview
value: 2024-02-15-preview
- label:
en_US: 2023-12-01-preview
value: 2023-12-01-preview
- label:
en_US: '2024-02-01'
value: '2024-02-01'
placeholder:
zh_Hans: 在此选择您的 API 版本
en_US: Select your API Version here
- variable: base_model_name
label:
en_US: Base Model
zh_Hans: 基础模型
type: select
required: true
options:
- label:
en_US: gpt-35-turbo
value: gpt-35-turbo
show_on:
- variable: __model_type
value: llm
- label:
en_US: whisper-1
value: whisper-1
show_on:
- variable: __model_type
value: speech2text
- label:
en_US: tts-1
value: tts-1
show_on:
- variable: __model_type
value: tts
By following these steps, you can integrate your Whisper and Edge-TTS services into the Dify platform for use [1][2][3][4][5].
To continue talking to Dosu, mention @dosu.