split-lang
split-lang copied to clipboard
✨ Split text by languages (e.g. 你喜欢看アニメ吗 -> 你喜欢看 | アニメ | 吗) for NLP tasks (e.g. parse, TTS). Powered by fasttext and langua
split-lang
English | 中文简体 | 日本語
Split text by languages through concatenating over split substrings based on their language, powered by
splitting: budoux and rule-base splitting
language detection: fast-langdetect and wordfreq
1. 💡How it works
Stage 1: rule-based split (separate character, punctuation and digit)
hello, how are you->hello|,|how are you
Stage 2: over-split text to substrings by budoux for Chinese mix with Japanese, (space) for not scripta continua
你喜欢看アニメ吗->你|喜欢|看|アニメ|吗昨天見た映画はとても感動的でした->昨天|見た|映画|は|とても|感動|的|で|したhow are you->how|are|you
Stage 3: concatenate substrings based on their languages using fast-langdetect, wordfreq and regex (rule-based)
你|喜欢|看|アニメ|吗->你喜欢看|アニメ|吗昨天|見た|映画|は|とても|感動|的|で|した->昨天|見た映画はとても感動的でしたhow|are|you->how are you
2. 🪨Motivation
TTS (Text-To-Speech)model often fails on multi-language speech generation, there are two ways to do:- Train a model can pronounce multiple languages
- (This Package) Separate sentence based on language first, then use different language models
- Existed models in NLP toolkit (e.g.
SpaCy,jieba) is usually helpful for dealing with text in ONE language for each model. Which means multi-language texts need pre-process, like texts below:
你喜欢看アニメ吗?
Vielen Dank merci beaucoup for your help.
你最近好吗、最近どうですか?요즘 어떻게 지내요?sky is clear and sunny。
- 1. 💡How it works
- 2. 🪨Motivation
- 3. 📕Usage
- 3.1. 🚀Installation
- 3.2. Basic
- 3.2.1.
split_by_lang - 3.2.2.
merge_across_digit
- 3.2.1.
- 3.3. Advanced
- 3.3.1. usage of
lang_mapanddefault_lang(for your languages)
- 3.3.1. usage of
- 4. Acknowledgement
- 5. ✨Star History
3. 📕Usage
3.1. 🚀Installation
You can install the package using pip:
pip install split-lang
3.2. Basic
3.2.1. split_by_lang
from split_lang import LangSplitter
lang_splitter = LangSplitter()
text = "你喜欢看アニメ吗"
substr = lang_splitter.split_by_lang(
text=text,
)
for index, item in enumerate(substr):
print(f"{index}|{item.lang}:{item.text}")
0|zh:你喜欢看
1|ja:アニメ
2|zh:吗
from split_lang import LangSplitter
lang_splitter = LangSplitter(merge_across_punctuation=True)
import time
texts = [
"你喜欢看アニメ吗?我也喜欢看",
"Please star this project on GitHub, Thanks you. I love you请加星这个项目,谢谢你。我爱你この項目をスターしてください、ありがとうございます!愛してる",
]
time1 = time.time()
for text in texts:
substr = lang_splitter.split_by_lang(
text=text,
)
for index, item in enumerate(substr):
print(f"{index}|{item.lang}:{item.text}")
print("----------------------")
time2 = time.time()
print(time2 - time1)
0|zh:你喜欢看
1|ja:アニメ
2|zh:吗?我也喜欢看
----------------------
0|en:Please star this project on GitHub, Thanks you. I love you
1|zh:请加星这个项目,谢谢你。我爱你
2|ja:この項目をスターしてください、ありがとうございます!愛してる
----------------------
0.007998466491699219
3.2.2. merge_across_digit
lang_splitter.merge_across_digit = False
texts = [
"衬衫的价格是9.15便士",
]
for text in texts:
substr = lang_splitter.split_by_lang(
text=text,
)
for index, item in enumerate(substr):
print(f"{index}|{item.lang}:{item.text}")
0|zh:衬衫的价格是
1|digit:9.15
2|zh:便士
3.3. Advanced
3.3.1. usage of lang_map and default_lang (for your languages)
[!IMPORTANT] Add lang code for your usecase if other languages are needed. See Support Language
- default
lang_maplooks like below- if
langua-pyorfasttextor any other language detector detect the language that is NOT included inlang_mapwill be set todefault_lang - if you set
default_langorvalueofkey:valueinlang_maptox, this substring will be merged to the near substringzh|x|jp->zh|jp(xbeen merged to one side)- In example below,
zh-twis set toxbecause character inzhandjpsometimes been detected as Traditional Chinese
- if
- default
default_langisx
DEFAULT_LANG_MAP = {
"zh": "zh",
"yue": "zh", # 粤语
"wuu": "zh", # 吴语
"zh-cn": "zh",
"zh-tw": "x",
"ko": "ko",
"ja": "ja",
"de": "de",
"fr": "fr",
"en": "en",
"hr": "en",
}
DEFAULT_LANG = "x"
4. Acknowledgement
- Inspired by LlmKira/fast-langdetect
- Text segmentation depends on google/budoux
- Language detection depends on zafercavdar/fasttext-langdetect and rspeer/wordfreq