OpusCleaner
OpusCleaner copied to clipboard
The regexes for characters in a language match on only 1 character, not the entire token
https://github.com/hplt-project/OpusCleaner/blob/main/opuscleaner/filters/clean_common.py
The regexes for characters in a language match on only 1 character, not the entire token:
'ca': r'[a-zÀàÈèÉéÍíÒòÓóÚúÇç]',
'cs': r'[a-zÁáČčĎďÉéěÍíŇňÓóŘřŠšŤťÚúůÝýŽž]',
'da': r'[a-zÆæØøÅå]',
'de': r'[a-zÄäÖöÜüß]',
As opposed to:
'ca': r'^[a-zÀàÈèÉéÍíÒòÓóÚúÇç]$',
'cs': r'^[a-zÁáČčĎďÉéěÍíŇňÓóŘřŠšŤťÚúůÝýŽž]$',
'da': r'^[a-zÆæØøÅå]$',
'de': r'^[a-zÄäÖöÜüß]$',
So for instance "0a1e" will match as a word in this case. I don't know if this is the intention of this filter.
For instance here is how a regex is used:
num_words = sum(
[1 if re.match(CHARS[src_lang], t, re.IGNORECASE) else 0 for t in src_toks])