EasyOCR
EasyOCR copied to clipboard
Kyrgyz language support
Hello, thank you for your fantastic work.
Please, add the support of the Kyrgyz language. How can I help?
In this pull request I provide the list of characters and a list of words built based on the two corpora from here using this hacky script:
import re
paths = [#"data/kir_community_2017/kir_community_2017-words.txt",
"data/kir_newscrawl_2016_1M/kir_newscrawl_2016_1M-words.txt",
"data/kir_wikipedia_2021_300K/kir_wikipedia_2021_300K-words.txt"]
tokens = []
removable = re.compile(r"(.*[′…ЇЈЎ&')¤/´˅(\"A-Za-z0-9Α-Ωα-ω.úƒƖ½ö+ЄІ,:;?!>< ]+.*|Ё.*|\w-\w+)", re.UNICODE)
for path in paths:
with (open(path, "r", encoding="utf-8") as rf):
for line in rf:
line = line.strip()
if line:
split_line = line.split("\t")
count = int(split_line[2])
if count < 6:
continue
token = split_line[1].strip() \
.replace("ɵ", "ө") \
.replace("ϴ", "Ө") \
.replace("ʏ", "ү")
token = token.strip("•₣‰ʿ°—‘»²¬/µ«£:;“”„'()´`$%–№.,-")
if len(token) > 2 and not removable.match(token):
tokens.append(token)
tokens = sorted(list(set(tokens)))
tokens_clipped_tail = []
for token in tokens:
if token == "өөө":
break
else:
tokens_clipped_tail.append(token)
with open("ky.txt", "w", encoding="utf-8") as wf:
wf.write("\n".join(tokens_clipped_tail))
print(f"A total of {len(tokens_clipped_tail)} tokens.")
Best regards, Anton.
Can you help me with the Kazakh lang, Write me please https://t.me/hellomik