dify icon indicating copy to clipboard operation
dify copied to clipboard

How can I integrate a local TTS API and STT API?

Open zhanghuanhuanlive opened this issue 1 year ago • 3 comments

Self Checks

  • [X] I have searched for existing issues search for existing issues, including closed ones.
  • [X] I confirm that I am using English to submit report (我已阅读并同意 Language Policy).
  • [X] Please do not modify this template :) and fill in all the required fields.

Provide a description of requested docs changes

How can I integrate a local TTS API and STT API like edge-tts and whisper? Somebody show me the way, pls.

zhanghuanhuanlive avatar May 30 '24 09:05 zhanghuanhuanlive

Hey @zhanghuanhuanlive! 👋 I'm here to help you with any bugs, questions, or contributions. I'm a bot and I can assist you while you wait for a human maintainer.

To integrate a local TTS API like edge-tts and a local STT API like whisper in your project, you can follow this example:

Integrating whisper for Speech-to-Text (STT)

First, create a function to handle the STT using whisper:

import requests
from yarl import URL

class LocalAISpeech2text:
    def __init__(self, server_url):
        self.server_url = server_url

    def transcribe_audio(self, model: str, audio_file_path: str) -> str:
        url = str(URL(self.server_url) / "v1/audio/transcriptions")
        data = {"model": model}
        files = {"file": open(audio_file_path, 'rb')}

        response = requests.post(url, data=data, files=files)
        response.raise_for_status()

        if 'error' in response.json():
            raise Exception("Error in transcription: " + response.json()['error'])

        return response.json()["text"]

# Example usage
stt = LocalAISpeech2text(server_url="http://localhost:5000")
transcription = stt.transcribe_audio(model="whisper", audio_file_path="path/to/audio/file.wav")
print(transcription)

Integrating edge-tts for Text-to-Speech (TTS)

Next, create a function to handle the TTS using edge-tts:

import edge_tts

async def text_to_speech(text: str, output_file: str):
    communicate = edge_tts.Communicate(text, "en-US-JennyNeural")
    await communicate.save(output_file)

# Example usage
import asyncio

text = "Hello, this is a test."
output_file = "output_audio.mp3"
asyncio.run(text_to_speech(text, output_file))

Putting It All Together

Create a simple script that uses both the STT and TTS functions:

import asyncio

# STT part
stt = LocalAISpeech2text(server_url="http://localhost:5000")
transcription = stt.transcribe_audio(model="whisper", audio_file_path="path/to/audio/file.wav")
print("Transcription:", transcription)

# TTS part
text = transcription  # or any other text
output_file = "output_audio.mp3"
asyncio.run(text_to_speech(text, output_file))
print(f"Audio saved to {output_file}")

This script will transcribe an audio file using whisper and then convert the transcribed text back to speech using edge-tts. Adjust the server URL, model names, and file paths as needed for your specific setup [1][2][3].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar May 30 '24 09:05 dosubot[bot]

Thanks! Actually, what I want to ask is how to integrate my already developed Whisper and Edge-TTS into the Dify platform for use.

zhanghuanhuanlive avatar May 30 '24 13:05 zhanghuanhuanlive

To integrate your already developed Whisper (STT) and Edge-TTS (TTS) into the Dify platform, you need to modify the AudioService class to use your implementations. Here’s how you can do it:

1. Modify AudioService to Use Whisper for STT

# services/audio_service.py

from whisper import Whisper  # Import your Whisper implementation

class AudioService:
    @staticmethod
    def transcript_asr(app_model, file, end_user):
        # Use Whisper for speech-to-text
        whisper_model = Whisper()
        transcript = whisper_model.transcribe(file)
        return {'transcript': transcript}

2. Modify AudioService to Use Edge-TTS for TTS

# services/audio_service.py

from edge_tts import EdgeTTS  # Import your Edge-TTS implementation

class AudioService:
    @staticmethod
    def transcript_tts(app_model, text, voice, end_user, streaming):
        # Use Edge-TTS for text-to-speech
        edge_tts_model = EdgeTTS()
        audio_data = edge_tts_model.synthesize(text, voice)
        return {'data': audio_data}

3. Ensure API Endpoints Use the Modified AudioService

Ensure that the AudioService methods are correctly called in the AudioApi and TextApi classes.

# api/controllers/web/audio.py

class AudioApi(WebApiResource):
    def post(self, app_model: App, end_user):
        file = request.files['file']

        try:
            response = AudioService.transcript_asr(
                app_model=app_model,
                file=file,
                end_user=end_user
            )
            return response
        except Exception as e:
            logging.exception(f"internal server error: {str(e)}")
            raise InternalServerError()

class TextApi(WebApiResource):
    def post(self, app_model: App, end_user):
        try:
            response = AudioService.transcript_tts(
                app_model=app_model,
                text=request.form['text'],
                end_user=end_user.external_user_id,
                voice=request.form['voice'] if request.form.get('voice') else app_model.app_model_config.text_to_speech_dict.get('voice'),
                streaming=False
            )
            return {'data': response['data'].decode('latin1')}
        except Exception as e:
            logging.exception(f"internal server error: {str(e)}")
            raise InternalServerError()

api.add_resource(AudioApi, '/audio-to-text')
api.add_resource(TextApi, '/text-to-audio')

4. Update TypeScript Service to Handle Requests

// web/service/share.ts

export const audioToText = (url: string, isPublicAPI: boolean, body: FormData) => {
  return (getAction('post', !isPublicAPI))(url, { body }, { bodyStringify: false, deleteContentType: true }) as Promise<{ text: string }>
}

export const textToAudio = (url: string, isPublicAPI: boolean, body: FormData) => {
  return (getAction('post', !isPublicAPI))(url, { body }, { bodyStringify: false, deleteContentType: true }) as Promise<{ data: string }>
}

5. Ensure Configuration Supports Custom Models

Ensure your Dify platform configuration supports the integration of custom models for STT and TTS. Here’s an example configuration:

provider: azure_openai
label:
  en_US: Azure OpenAI Service Model
icon_small:
  en_US: icon_s_en.svg
icon_large:
  en_US: icon_l_en.png
background: "#E3F0FF"
help:
  title:
    en_US: Get your API key from Azure
    zh_Hans: 从 Azure 获取 API Key
  url:
    en_US: https://azure.microsoft.com/en-us/products/ai-services/openai-service
supported_model_types:
  - llm
  - text-embedding
  - speech2text
  - tts
configurate_methods:
  - customizable-model
model_credential_schema:
  model:
    label:
      en_US: Deployment Name
      zh_Hans: 部署名称
    placeholder:
      en_US: Enter your Deployment Name here, matching the Azure deployment name.
      zh_Hans: 在此输入您的部署名称,与 Azure 部署名称匹配。
  credential_form_schemas:
    - variable: openai_api_base
      label:
        en_US: API Endpoint URL
        zh_Hans: API 域名
      type: text-input
      required: true
      placeholder:
        zh_Hans: '在此输入您的 API 域名,如:https://example.com/xxx'
        en_US: 'Enter your API Endpoint, eg: https://example.com/xxx'
    - variable: openai_api_key
      label:
        en_US: API Key
        zh_Hans: API Key
      type: secret-input
      required: true
      placeholder:
        zh_Hans: 在此输入您的 API Key
        en_US: Enter your API key here
    - variable: openai_api_version
      label:
        zh_Hans: API 版本
        en_US: API Version
      type: select
      required: true
      options:
        - label:
            en_US: 2024-02-15-preview
          value: 2024-02-15-preview
        - label:
            en_US: 2023-12-01-preview
          value: 2023-12-01-preview
        - label:
            en_US: '2024-02-01'
          value: '2024-02-01'
      placeholder:
        zh_Hans: 在此选择您的 API 版本
        en_US: Select your API Version here
    - variable: base_model_name
      label:
        en_US: Base Model
        zh_Hans: 基础模型
      type: select
      required: true
      options:
        - label:
            en_US: gpt-35-turbo
          value: gpt-35-turbo
          show_on:
            - variable: __model_type
              value: llm
        - label:
            en_US: whisper-1
          value: whisper-1
          show_on:
            - variable: __model_type
              value: speech2text
        - label:
            en_US: tts-1
          value: tts-1
          show_on:
            - variable: __model_type
              value: tts

By following these steps, you can integrate your Whisper and Edge-TTS services into the Dify platform for use [1][2][3][4][5].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar May 30 '24 13:05 dosubot[bot]