Request to Modify Code to Enable TEXT_SPLITTER_EMBEDDING_MODEL Customization through Configuration File

Open shawn-z11 opened this issue 1 year ago • 1 comments

I am looking to create a Chinese RAG demo service using RetrievalAugmentedGeneration.

However, I encountered an issue where the default SentenceTransformersTokenTextSplitter model used in the RetrievalAugmentedGeneration/common/utils.py file is hardcoded as 'intfloat/e5-large-v2'. This model generates a significant number of [UNK] tokens when processing Chinese text.

I would like the ability to specify a specific model for the text splitter, similar to how the embedding model can be specified through the config.yaml file.

Thank you for your assistance and support.

Jan 17 '24 06:01 shawn-z11

Have you tried abstraction or refactoring? Discourse

Feb 17 '24 15:02 SartajHundal