dify icon indicating copy to clipboard operation
dify copied to clipboard

How do I add speech-to-text and text-to-speech models when using xinference?

Open andylzming opened this issue 1 year ago • 3 comments

Self Checks

Dify version

0.5.2

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

How do I add speech-to-text and text-to-speech models when using xinference? Xinference already supports Image model and Audio model.

20

21

22

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

andylzming avatar Feb 02 '24 08:02 andylzming

🤖

Hey @andylzming! Great to see you back here, diving into more adventures with dify. 🚀

To add speech-to-text and text-to-speech models when using xinference in a self-hosted Docker environment with Dify version 0.5.2, you need to extend the provided TongyiText2SpeechModel class and implement the required methods.

Here is a simplified example of how you can do this:

class MyText2SpeechModel(TongyiText2SpeechModel):
    def _invoke(self, model: str, credentials: dict, content_text: str, streaming: bool, user: Optional[str] = None) -> any:
        # Your implementation here
        pass

    def validate_credentials(self, model: str, credentials: dict, user: Optional[str] = None) -> None:
        # Your implementation here
        pass

    def _tts_invoke(self, model: str, credentials: dict, content_text: str, user: Optional[str] = None) -> Response:
        # Your implementation here
        pass

    def _tts_invoke_streaming(self, model: str, credentials: dict, content_text: str, user: Optional[str] = None) -> any:
        # Your implementation here
        pass

    def _process_sentence(self, sentence: str, model: str, credentials: dict, audio_type: str):
        # Your implementation here
        pass

You need to replace the # Your implementation here comments with your own code. The _invoke method is used to invoke the text-to-speech model, _tts_invoke and _tts_invoke_streaming are used to process the text and convert it to speech, and _process_sentence is used to process each sentence individually. The validate_credentials method is used to validate the credentials for the text-to-speech model.

Once you have implemented these methods, you can use your new class in your Docker environment. Make sure to provide the necessary credentials and other parameters when invoking the model.

You can find more details about the TongyiText2SpeechModel class in the dify codebase.

Please note that I'm unable to view the images you've attached. If you need further assistance, could you please provide more details or context in text form?

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] avatar Feb 02 '24 08:02 dosubot[bot]

See https://docs.dify.ai/tutorials/model-configuration/xinference#starting-xinference for more details. Please use the right template to fire a bug.

crazywoola avatar Feb 02 '24 09:02 crazywoola

@crazywoola Is the functionality of adding Image models and Audit models (including speech-to-text, text-to-speech) through xinference set to be implemented in the next version?

andylzming avatar Feb 05 '24 07:02 andylzming

@crazywoola Is the functionality of adding Image models and Audit models (including speech-to-text, text-to-speech) through xinference set to be implemented in the next version?

We might integrate these models in our spare time, not sure though. If you're interested, you can contribute to us and we would really appreciate it!

takatost avatar Feb 07 '24 08:02 takatost

#3045

leslie2046 avatar Mar 30 '24 14:03 leslie2046