split-lang

English | 中文简体 | 日本語

Split text by languages through concatenating over split substrings based on their language, powered by

splitting: budoux and rule-base splitting

language detection: fast-langdetect and wordfreq

GitHub Repo stars

1. 💡How it works

Stage 1: rule-based split (separate character, punctuation and digit)

hello, how are you -> hello | , | how are you

Stage 2: over-split text to substrings by budoux for Chinese mix with Japanese, (space) for not scripta continua

你喜欢看アニメ吗 -> 你 | 喜欢 | 看 | アニメ | 吗
昨天見た映画はとても感動的でした -> 昨天 | 見た | 映画 | は | とても | 感動 | 的 | で | した
how are you -> how | are | you

Stage 3: concatenate substrings based on their languages using fast-langdetect, wordfreq and regex (rule-based)

你 | 喜欢 | 看 | アニメ | 吗 -> 你喜欢看 | アニメ | 吗
昨天 | 見た | 映画 | は | とても | 感動 | 的 | で | した -> 昨天 | 見た映画はとても感動的でした
how | are | you -> how are you

2. 🪨Motivation

TTS (Text-To-Speech) model often fails on multi-language speech generation, there are two ways to do:
- Train a model can pronounce multiple languages
- (This Package) Separate sentence based on language first, then use different language models
Existed models in NLP toolkit (e.g. SpaCy, jieba) is usually helpful for dealing with text in ONE language for each model. Which means multi-language texts need pre-process, like texts below:

你喜欢看アニメ吗？
Vielen Dank merci beaucoup for your help.
你最近好吗、最近どうですか？요즘 어떻게 지내요？sky is clear and sunny。

1. 💡How it works
2. 🪨Motivation
3. 📕Usage
- 3.1. 🚀Installation
- 3.2. Basic
  - 3.2.1. split_by_lang
  - 3.2.2. merge_across_digit
- 3.3. Advanced
  - 3.3.1. usage of lang_map and default_lang (for your languages)
4. Acknowledgement
5. ✨Star History

3. 📕Usage

3.1. 🚀Installation

You can install the package using pip:

pip install split-lang

3.2. Basic

3.2.1. `split_by_lang`

from split_lang import LangSplitter
lang_splitter = LangSplitter()
text = "你喜欢看アニメ吗"

substr = lang_splitter.split_by_lang(
    text=text,
)
for index, item in enumerate(substr):
    print(f"{index}|{item.lang}:{item.text}")

0|zh:你喜欢看
1|ja:アニメ
2|zh:吗

from split_lang import LangSplitter
lang_splitter = LangSplitter(merge_across_punctuation=True)
import time
texts = [
    "你喜欢看アニメ吗？我也喜欢看",
    "Please star this project on GitHub, Thanks you. I love you请加星这个项目，谢谢你。我爱你この項目をスターしてください、ありがとうございます！愛してる",
]
time1 = time.time()
for text in texts:
    substr = lang_splitter.split_by_lang(
        text=text,
    )
    for index, item in enumerate(substr):
        print(f"{index}|{item.lang}:{item.text}")
    print("----------------------")
time2 = time.time()
print(time2 - time1)

0|zh:你喜欢看
1|ja:アニメ
2|zh:吗？我也喜欢看
----------------------
0|en:Please star this project on GitHub, Thanks you. I love you
1|zh:请加星这个项目，谢谢你。我爱你
2|ja:この項目をスターしてください、ありがとうございます！愛してる
----------------------
0.007998466491699219

3.2.2. `merge_across_digit`

lang_splitter.merge_across_digit = False
texts = [
    "衬衫的价格是9.15便士",
]
for text in texts:
    substr = lang_splitter.split_by_lang(
        text=text,
    )
    for index, item in enumerate(substr):
        print(f"{index}|{item.lang}:{item.text}")

0|zh:衬衫的价格是
1|digit:9.15
2|zh:便士

3.3. Advanced

3.3.1. usage of `lang_map` and `default_lang` (for your languages)

[!IMPORTANT] Add lang code for your usecase if other languages are needed. See Support Language

default lang_map looks like below
- if langua-py or fasttext or any other language detector detect the language that is NOT included in lang_map will be set to default_lang
- if you set default_lang or value of key:value in lang_map to x, this substring will be merged to the near substring
  - zh | x | jp -> zh | jp (x been merged to one side)
  - In example below, zh-tw is set to x because character in zh and jp sometimes been detected as Traditional Chinese
default default_lang is x

DEFAULT_LANG_MAP = {
    "zh": "zh",
    "yue": "zh",  # 粤语
    "wuu": "zh",  # 吴语
    "zh-cn": "zh",
    "zh-tw": "x",
    "ko": "ko",
    "ja": "ja",
    "de": "de",
    "fr": "fr",
    "en": "en",
    "hr": "en",
}
DEFAULT_LANG = "x"

4. Acknowledgement

Inspired by LlmKira/fast-langdetect
Text segmentation depends on google/budoux
Language detection depends on zafercavdar/fasttext-langdetect and rspeer/wordfreq

split-lang
split-lang copied to clipboard

Metadata

split-lang

1. 💡How it works

2. 🪨Motivation

3. 📕Usage

3.1. 🚀Installation

3.2. Basic

3.2.1. `split_by_lang`

3.2.2. `merge_across_digit`

3.3. Advanced

3.3.1. usage of `lang_map` and `default_lang` (for your languages)

4. Acknowledgement

5. ✨Star History

← Metadata

Owner

Metadata

split-lang split-lang copied to clipboard

Metadata

split-lang

1. 💡How it works

2. 🪨Motivation

3. 📕Usage

3.1. 🚀Installation

3.2. Basic

3.2.1. split_by_lang

3.2.2. merge_across_digit

3.3. Advanced

3.3.1. usage of lang_map and default_lang (for your languages)

4. Acknowledgement

5. ✨Star History

← Metadata

Owner

Metadata

split-lang
split-lang copied to clipboard

3.2.1. `split_by_lang`

3.2.2. `merge_across_digit`

3.3.1. usage of `lang_map` and `default_lang` (for your languages)