ajaykg

Results 2 issues of ajaykg

``` >>> import regex as re >>> gpt2pat = re.compile(r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""" ) >>> str = r"""हहिन्दी विकिपीडिया""" >>> print (re.findall(gpt2pat, str )) ['हह', 'िन', '्द', 'ी', ' व', 'िक', 'िप',...

Fixing the problem that all tokenizers have with regard to all combining marks like diacritics, Indic Matras (vowels after consonants) Indic Halant, Arabic, Hebrew etc. This was probably breaking most...