Kyrgyz language support

Open alexeyev opened this issue 1 year ago • 1 comments

Hello, thank you for your fantastic work.

Please, add the support of the Kyrgyz language. How can I help?

In this pull request I provide the list of characters and a list of words built based on the two corpora from here using this hacky script:

import re

paths = [#"data/kir_community_2017/kir_community_2017-words.txt",
         "data/kir_newscrawl_2016_1M/kir_newscrawl_2016_1M-words.txt",
         "data/kir_wikipedia_2021_300K/kir_wikipedia_2021_300K-words.txt"]

tokens = []
removable = re.compile(r"(.*[′…ЇЈЎ&')¤/´˅(\"A-Za-z0-9Α-Ωα-ω.úƒƖ½ö+ЄІ,:;?!>< ]+.*|Ё.*|\w-\w+)", re.UNICODE)

for path in paths:
    with (open(path, "r", encoding="utf-8") as rf):
        for line in rf:
            line = line.strip()
            if line:
                split_line = line.split("\t")
                count = int(split_line[2])
                if count < 6:
                    continue
                token = split_line[1].strip() \
                    .replace("ɵ", "ө") \
                    .replace("ϴ", "Ө") \
                    .replace("ʏ", "ү")
                token = token.strip("•₣‰ʿ°—‘»²¬/µ«£:;“”„'()´`$%–№.,-")
                if len(token) > 2 and not removable.match(token):
                    tokens.append(token)

tokens = sorted(list(set(tokens)))
tokens_clipped_tail = []

for token in tokens:
    if token == "өөө":
        break
    else:
        tokens_clipped_tail.append(token)

with open("ky.txt", "w", encoding="utf-8") as wf:
    wf.write("\n".join(tokens_clipped_tail))

print(f"A total of {len(tokens_clipped_tail)} tokens.")

Best regards, Anton.

Dec 05 '24 06:12 alexeyev

Can you help me with the Kazakh lang, Write me please https://t.me/hellomik

Dec 16 '24 14:12 Hellomik2002